Skip to main content

Semantic Search

CTWise API uses AI-powered semantic search to understand the meaning of your queries, not just keywords.


Overview

Traditional regulatory databases require exact keyword matches. CTWise uses Amazon Bedrock Titan Text Embeddings v2 and AWS S3 Vectors to understand what you're actually looking for.

Query: "informed consent pediatric"

Keyword Result: Only documents containing BOTH exact words
Missed: "assent procedures for minors", "parental permission requirements"

The Semantic Search Advantage

Query: "What are the requirements for informed consent in pediatric trials?"

Semantic Result:
1. FDA-INFORMED-CONSENT-2024 (score: 0.56) - Informed consent guidance
2. ICH-E11(R1) (score: 0.50) - Pediatric population guidance
3. FDA-PEDIATRIC-2023 (score: 0.44) - Pediatric study plans

Why: AI understands the MEANING relates to consent + children + trials

How It Works

1. Query Embedding

Your natural language query is converted to a 1024-dimensional vector using Amazon Bedrock Titan:

Query: "What guidance exists for adaptive trial designs?"

└─► Titan Embed → [0.12, -0.45, 0.78, ...] (1024 dimensions)

AWS S3 Vectors performs approximate nearest neighbor search against pre-indexed regulatory rules:

Query Vector → S3 Vectors Index

├─► FDA-ADAPTIVE-2019 → similarity: 0.7769
├─► ICH-E20 → similarity: 0.5411
└─► FDA-DMC-2024-DRAFT → similarity: 0.3955

3. Ranked Results

Results are returned sorted by semantic similarity with confidence scores:

{
"results": [
{
"rule_id": "FDA-ADAPTIVE-2019",
"title": "Adaptive Designs for Clinical Trials of Drugs and Biologics",
"similarity_score": 0.7769,
"source": "fda"
}
]
}

Natural Language Query Examples

Regulatory Concept Queries

QueryTop ResultScore
"What are the requirements for informed consent in pediatric trials?"FDA-INFORMED-CONSENT-20240.56
"How should I handle adverse event reporting?"ICH-E2A0.44
"What statistical methods are acceptable for phase 3?"ICH-E9(R1)0.46
"GCP guidelines for investigator responsibilities"ICH-E6(R3)0.55

Process-Oriented Queries

QueryTop ResultScore
"What guidance exists for adaptive trial designs?"FDA-ADAPTIVE-20190.78
"How do I establish a Data Safety Monitoring Board?"FDA-DMC-2024-DRAFT0.58
"Explain protocol amendment procedures"ICH-E6(R3)0.47
"What training is required for clinical investigators?"ICH-E6(R2)0.45

Domain-Specific Queries

QueryTop ResultScore
"Tell me about blinding requirements in controlled trials"ICH-E100.43
"What are the monitoring requirements for multi-site studies?"ICH-E6(R3)0.52
"How should biomarker data be collected and analyzed?"ICH-E160.41

API Usage

Semantic Search Endpoint

POST Method (Recommended):

curl -X POST https://api.ctwise.ai/v1/semantic-search \
-H "X-Api-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"query": "What are the requirements for informed consent in pediatric trials?",
"sources": ["fda", "ich"],
"top_k": 5,
"min_similarity": 0.25
}'

GET Method (Alternative via query parameters):

curl "https://api.ctwise.ai/v1/rules/search?q=informed+consent+pediatric&sources=fda,ich&limit=5" \
-H "X-Api-Key: YOUR_API_KEY"

Request Parameters

ParameterTypeRequiredDescription
querystringYesNatural language question
sourcesstring[]NoFilter by source (fda, ich, ema, who)
top_kintegerNoNumber of results (default: 5, max: 50)
min_similarityfloatNoMinimum similarity threshold (default: 0.25)

Response

{
"query": "What are the requirements for informed consent in pediatric trials?",
"results": [
{
"rule_id": "FDA-INFORMED-CONSENT-2024",
"title": "Informed Consent: Guidance for IRBs, Clinical Investigators, and Sponsors",
"source": "fda",
"similarity_score": 0.5594,
"effective_date": "2024-01-01"
},
{
"rule_id": "ICH-E11(R1)",
"title": "Clinical Investigation of Medicinal Products in the Pediatric Population",
"source": "ich",
"similarity_score": 0.5022,
"effective_date": "2017-09-14"
}
],
"query_metadata": {
"execution_time_ms": 380,
"embedding_model": "amazon.titan-embed-text-v2:0",
"indexes_searched": ["fda-tier1", "ich-tier1"],
"total_results": 5
}
}

Similarity Scoring

Score Interpretation

Score RangeMeaningRecommendation
0.70+High confidence matchDirectly relevant
0.50-0.70Good matchReview for relevance
0.25-0.50Partial matchMay be related
< 0.25Below thresholdNot returned

Configuring Thresholds

For different use cases, adjust the min_similarity parameter:

Use CaseThresholdRationale
Broad discovery0.20Find loosely related rules
Standard search0.25Balanced precision/recall
Precise matching0.40High-confidence matches only

Cross-Source Discovery

Semantic search excels at finding related rules across different regulatory authorities:

A single query about "informed consent requirements" returns:

FDA Results:
├── FDA-INFORMED-CONSENT-2024 (0.56)
└── FDA-PEDIATRIC-2023 (0.44)

ICH Results:
├── ICH-E11(R1) (0.50) - Pediatric
├── ICH-E6(R3) (0.39) - GCP
└── ICH-E8(R1) (0.29) - General Considerations

Why this matters: Traditional keyword search would require separate queries to each regulatory body. Semantic search understands the concept spans multiple sources.


Performance Characteristics

MetricValueNotes
Average response time380msIncluding embedding generation
P95 response timeLess than 600msUnder load
Embedding modelTitan Text v21024 dimensions
Vector databaseAWS S3 VectorsNative AWS integration
Indexes availableFDA, ICH, EMA, WHOTier-dependent access

Best Practices

1. Ask Complete Questions

Good: "What are the requirements for informed consent in pediatric trials?"
Poor: "informed consent pediatric"

2. Include Context

Good: "What statistical methods are acceptable for phase 3 oncology trials?"
Poor: "statistics trials"

3. Use Natural Language

Good: "How should adverse events be reported to the FDA?"
Poor: "AE reporting FDA"

4. Specify Domains When Known

# If you know you want FDA guidance specifically
response = search(
query="adaptive trial design requirements",
sources=["fda"], # Limits search scope
top_k=10
)

Technology Stack

ComponentTechnologyPurpose
Embedding ModelAmazon Bedrock Titan Text Embeddings v21024-dimensional semantic encoding
Vector DatabaseAWS S3 VectorsCosine similarity search
Query ProcessingAWS Lambda (ARM64)Cost-optimized inference
IndexesFDA-tier1, ICH-tier1, EMA-tier1, WHO-tier1Pre-computed regulatory rule vectors

Verified Performance (2025-12-18)

MetricResult
Tests executed20 natural language queries
Success rate95% (19/20 returned results)
Highest score0.7769 ("adaptive trial designs")
Average score0.41
Average response380ms

Evidence: See /aws_mp_set_up/products/ctwise/nlp_evidence/NLP_EVIDENCE_SUMMARY.md