← all environments

Systematic Review & Meta-Analysis

Evidence synthesis / HEOREvidence scientist / systematic reviewer

Given a registered PICO question, screen a pool of PubMed/Embase-shaped study abstracts per explicit inclusion/exclusion criteria (PRISMA discipline: RCT-only, right population/intervention/comparator, outcome reported), then pool the reported effect sizes of the included studies into a single meta-analytic estimate. Pooling is inverse-variance on the log scale with the DerSimonian-Laird random-effects model; heterogeneity is assessed via Cochran's Q and I^2 (I^2>50% => random-effects). The agent has read-only study-query tools that return raw records and must screen and pool itself.

Why this is fundable

Scarce expert who grades this
Evidence scientist / systematic reviewer with biostatistics training (Cochrane-method meta-analyst, ~$150-300/hr loaded)
What one decision is worth
Meta-analyses are the top of the evidence pyramid: they drive clinical-practice guidelines, HTA/payer coverage and pricing decisions, and internal go/no-go on development programs. A wrong pooled estimate or a screening error can swing a guideline recommendation or a reimbursement decision worth hundreds of millions to billions, or greenlight a program the evidence doesn't support.
Real-world data sources
PubMed / Embase abstracts and trial reports (effect estimates + 95% CIs), Cochrane CENTRAL, ClinicalTrials.gov results, with Cochrane RoB 2 for risk of bias. Curated teaching snapshot here; refreshable from live bibliographic APIs.

Agent tools

list_review_questionsget_inclusion_criteriasearch_studiesget_study

Expert grading rubric

Dimension5 (excellent)1 (poor)
Screening accuracy & PRISMA disciplineApplies the explicit PICO + RCT-only criteria correctly: includes exactly the eligible RCTs and excludes the rest, each with the correct concrete reason (wrong design, population, comparator, or outcome not reported), and reports coherent PRISMA identification/screening/eligibility/included counts.Includes ineligible studies (e.g. the observational cohort, the VTE/valve populations, the placebo/aspirin or DOAC-vs-DOAC comparators, or the study that doesn't report the outcome), drops eligible RCTs, or gives no/garbled PRISMA flow.
Effect-measure & log-scale handlingUses the correct measure (OR for efficacy, RR for safety), pools on the LOG scale, and derives each study's SE from its CI as (ln(hi)-ln(lo))/(2*1.96) rather than treating the point estimate or CI on the natural scale.Pools raw (non-log) ratios, mishandles or invents the standard errors, mixes OR and RR, or pulls the wrong outcome's effect for a study.
Pooling-model choice & I^2 interpretationComputes Cochran's Q and I^2, and chooses fixed vs random-effects coherently with the heterogeneity (I^2>50% => random-effects), naming the DerSimonian-Laird estimator and interpreting I^2 correctly.Ignores heterogeneity, picks a fixed-effect model despite high I^2 (or vice versa with no rationale), or misreads what I^2 means.
Numerical correctness of the pooled estimate & CIThe pooled point estimate, 95% CI, and I^2 match the deterministic inverse-variance / DerSimonian-Laird computation within rounding, and the pooled estimate sits within the range of the included study estimates.Pooled estimate or CI is materially wrong, falls outside the plausible range of inputs, has an inverted/implausible CI, or the arithmetic is unjustified.
Evidence faithfulnessEvery study, effect estimate, and CI used traces to the actual tool outputs; no fabricated trials, effects, or CIs, and excluded studies' numbers are not smuggled into the pool.Fabricates studies or effect sizes, alters the reported CIs, or pools effects from studies it claimed to exclude.

Example queries

Trajectories

model panel (compare side by side)

ModelProviderTierJudge 1–5Verdict
Claude Opus 4.8anthropicfrontier3.6flawed
GPT (frontier)openaifrontier3.2flawed
Claude Haiku 4.5anthropicsmall3.0flawed
GPT-4o miniopenaismall1.2unusable