Skip to main content

Semantic Search

CTWise API uses AI-powered semantic search to understand the meaning of your queries, not just keywords.


Overview​

Traditional regulatory databases require exact keyword matches. CTWise uses Amazon Bedrock Titan Text Embeddings v2 and AWS S3 Vectors to understand what you're actually looking for.

Query: "informed consent pediatric"

Keyword Result: Only documents containing BOTH exact words
Missed: "assent procedures for minors", "parental permission requirements"

The Semantic Search Advantage​

Query: "What are the requirements for informed consent in pediatric trials?"

Semantic Result:
1. FDA-INFORMED-CONSENT-2024 (score: 0.56) - Informed consent guidance
2. ICH-E11(R1) (score: 0.50) - Pediatric population guidance
3. FDA-PEDIATRIC-2023 (score: 0.44) - Pediatric study plans

Why: AI understands the MEANING relates to consent + children + trials

How It Works​

1. Query Embedding​

Your natural language query is converted to a 1024-dimensional vector using Amazon Bedrock Titan:

Query: "What guidance exists for adaptive trial designs?"
│
└─► Titan Embed → [0.12, -0.45, 0.78, ...] (1024 dimensions)

AWS S3 Vectors performs approximate nearest neighbor search against pre-indexed regulatory rules:

Query Vector → S3 Vectors Index
│
├─► FDA-ADAPTIVE-2019 → similarity: 0.7769
├─► ICH-E20 → similarity: 0.5411
└─► FDA-DMC-2024-DRAFT → similarity: 0.3955

3. Ranked Results​

Results are returned sorted by semantic similarity with confidence scores:

{
"results": [
{
"rule_id": "FDA-ADAPTIVE-2019",
"title": "Adaptive Designs for Clinical Trials of Drugs and Biologics",
"similarity_score": 0.7769,
"source": "fda"
}
]
}

Natural Language Query Examples​

Regulatory Concept Queries​

QueryTop ResultScore
"What are the requirements for informed consent in pediatric trials?"FDA-INFORMED-CONSENT-20240.56
"How should I handle adverse event reporting?"ICH-E2A0.44
"What statistical methods are acceptable for phase 3?"ICH-E9(R1)0.46
"GCP guidelines for investigator responsibilities"ICH-E6(R3)0.55

Process-Oriented Queries​

QueryTop ResultScore
"What guidance exists for adaptive trial designs?"FDA-ADAPTIVE-20190.78
"How do I establish a Data Safety Monitoring Board?"FDA-DMC-2024-DRAFT0.58
"Explain protocol amendment procedures"ICH-E6(R3)0.47
"What training is required for clinical investigators?"ICH-E6(R2)0.45

Domain-Specific Queries​

QueryTop ResultScore
"Tell me about blinding requirements in controlled trials"ICH-E100.43
"What are the monitoring requirements for multi-site studies?"ICH-E6(R3)0.52
"How should biomarker data be collected and analyzed?"ICH-E160.41

API Usage​

Semantic Search Endpoint​

POST Method (Recommended):

curl -X POST https://api.ctwise.ai/v1/semantic-search \
-H "x-api-key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"query": "What are the requirements for informed consent in pediatric trials?",
"sources": ["fda", "ich"],
"top_k": 5,
"min_similarity": 0.25
}'

GET Method (Alternative via query parameters):

curl "https://api.ctwise.ai/v1/rules/search?q=informed+consent+pediatric&sources=fda,ich&limit=5" \
-H "x-api-key: YOUR_API_KEY"

Request Parameters​

ParameterTypeRequiredDescription
querystringYesNatural language question
sourcesstring[]NoFilter by source (fda, ich, ema, who)
top_kintegerNoNumber of results (default: 5, max: 50)
min_similarityfloatNoMinimum similarity threshold (default: 0.25)

Response​

{
"query": "What are the requirements for informed consent in pediatric trials?",
"results": [
{
"rule_id": "FDA-INFORMED-CONSENT-2024",
"title": "Informed Consent: Guidance for IRBs, Clinical Investigators, and Sponsors",
"source": "fda",
"similarity_score": 0.5594,
"effective_date": "2024-01-01"
},
{
"rule_id": "ICH-E11(R1)",
"title": "Clinical Investigation of Medicinal Products in the Pediatric Population",
"source": "ich",
"similarity_score": 0.5022,
"effective_date": "2017-09-14"
}
],
"query_metadata": {
"execution_time_ms": 380,
"embedding_model": "amazon.titan-embed-text-v2:0",
"indexes_searched": ["fda-tier1", "ich-tier1"],
"total_results": 5
}
}

Similarity Scoring​

Score Interpretation​

Score RangeMeaningRecommendation
0.70+High confidence matchDirectly relevant
0.50-0.70Good matchReview for relevance
0.25-0.50Partial matchMay be related
< 0.25Below thresholdNot returned

Configuring Thresholds​

For different use cases, adjust the min_similarity parameter:

Use CaseThresholdRationale
Broad discovery0.20Find loosely related rules
Standard search0.25Balanced precision/recall
Precise matching0.40High-confidence matches only

Cross-Source Discovery​

Semantic search excels at finding related rules across different regulatory authorities:

A single query about "informed consent requirements" returns:

FDA Results:
├── FDA-INFORMED-CONSENT-2024 (0.56)
└── FDA-PEDIATRIC-2023 (0.44)

ICH Results:
├── ICH-E11(R1) (0.50) - Pediatric
├── ICH-E6(R3) (0.39) - GCP
└── ICH-E8(R1) (0.29) - General Considerations

Why this matters: Traditional keyword search would require separate queries to each regulatory body. Semantic search understands the concept spans multiple sources.


Performance Characteristics​

MetricValueNotes
Average response time380msIncluding embedding generation
P95 response timeLess than 600msUnder load
Embedding modelTitan Text v21024 dimensions
Vector databaseAWS S3 VectorsNative AWS integration
Indexes availableFDA, ICH, EMA, WHOTier-dependent access

Best Practices​

1. Ask Complete Questions​

Good: "What are the requirements for informed consent in pediatric trials?"
Poor: "informed consent pediatric"

2. Include Context​

Good: "What statistical methods are acceptable for phase 3 oncology trials?"
Poor: "statistics trials"

3. Use Natural Language​

Good: "How should adverse events be reported to the FDA?"
Poor: "AE reporting FDA"

4. Specify Domains When Known​

# If you know you want FDA guidance specifically
response = search(
query="adaptive trial design requirements",
sources=["fda"], # Limits search scope
top_k=10
)

Technology Stack​

ComponentTechnologyPurpose
Embedding ModelAmazon Bedrock Titan Text Embeddings v21024-dimensional semantic encoding
Vector DatabaseAWS S3 VectorsCosine similarity search
Query ProcessingAWS Lambda (ARM64)Cost-optimized inference
IndexesFDA-tier1, ICH-tier1, EMA-tier1, WHO-tier1Pre-computed regulatory rule vectors

Verified Performance (2025-12-18)​

MetricResult
Tests executed20 natural language queries
Success rate95% (19/20 returned results)
Highest score0.7769 ("adaptive trial designs")
Average score0.41
Average response380ms

Evidence: See /aws_mp_set_up/products/ctwise/nlp_evidence/NLP_EVIDENCE_SUMMARY.md