← home

Expert-grade environments & data for biopharma agents — investment thesis

An honest assessment, written to survive a sharp biopharma VC, not to flatter the idea.

One line

We build the expert-graded RL environments and the labeled trajectory data that frontier labs and bio-AI companies need to make AI agents competent at high-value biopharma knowledge work — "Mercor for biopharma workflows," but productized as reusable environments, not just a labor marketplace.


Why this is a real, fundable category (not a feature)

  1. The training-data frontier has moved from web text to expert RL environments. Pre-training on scraped text is saturating. The marginal capability gain now comes from verifiable environments and expert feedback on agent trajectories — exactly the asset class here. Mercor (~$10B, 2025), Surge, Scale, Turing, and the "RL gym" wave (Prime Intellect, Mechanize) are all funded on this thesis. The category has demonstrated lab willingness to pay at scale.

  2. Biopharma is one of the highest willingness-to-pay knowledge-work domains on earth. A single target-prioritization, FTO, or trial-design decision routinely gates $50M–$5B. The experts who make these calls (MDs, PharmDs, patent attorneys, computational biologists) bill $200–800/hr and are scarce. Labs and pharma will pay a large premium for data that teaches an agent to do this work — because the downstream decision is worth orders of magnitude more than the label.

  3. These workflows decompose into gradable units. We catalogued ~35 distinct biopharma workflows across 7 functional domains (discovery → preclinical → clinical → regulatory → medical → commercial/BD → CMC). Each has a defined role, structured public data inputs (AACT, GTEx/TCGA, patents, FAERS, labels), and an output an expert can score 1–5. That gradability is what makes it a data product, not consulting.

  4. The deliverable is an appreciating asset, not labor hours. Each environment is reusable: it generates unlimited agent trajectories, and every expert grade becomes answer-key / reward signal. The library + accumulated grades compound into a proprietary dataset and, eventually, biopharma-specific reward models — the moat a pure marketplace lacks.


The wedge (why us, why now)


What we've built (the demo)

A generic environment framework + 10 fully working environments spanning the value chain, each with read-only data tools, a 5-dimension expert rubric, a deterministic answer key where one exists, and example queries — viewable and gradable in a web app:

Workflow (of ~35) Domain Role / scarce expert What one decision is worth
Target validation / tractability Discovery Target biologist most pipeline failures are wrong targets
First-in-human dose selection Clinical pharm Clinical pharmacologist TGN1412-style catastrophe vs failed trial
Competitive trial-landscape Clinical / CI Oncology CI analyst (MD/PharmD) $50M–$500M development bets
Eligibility-criteria authoring Clinical Medical monitor (MD) patient safety + 12mo / millions in enrollment
AE coding & causality (PV) Pharmacovigilance Drug-safety physician missed safety signals → withdrawals
Regulatory strategy Regulatory Reg-affairs lead / ex-FDA years off time-to-market
Systematic review / meta-analysis Evidence Evidence scientist / biostatistician drives guidelines, HTA, go/no-go
Epidemiology market sizing Commercial Forecasting analyst $100M–$5B licensing / M&A valuation
Patent / freedom-to-operate BD / IP IP attorney ($400–800/hr) kills or clears a whole asset
BD asset due diligence / rNPV Corp dev BD/licensing analyst anchors $100M–$5B deal prices

The agent sees only the query + tools — never the answer key — so trajectories are grounded in the environment, and the expert grades reasoning, not recall. Trajectories run on any frontier or weaker model, across providers (Claude and OpenAI), which is what makes the eval meaningful.

This is more than a viewer — it's the full data machine: - Benchmark / eval — every model in a panel (Claude Opus 4.8, Haiku 4.5, GPT-5.2, GPT-4o-mini) run on each environment and scored 1–5 by an LLM judge calibrated to the same expert rubric a human grades on. Produces a model × workflow leaderboard. Early result: the environments cleanly discriminate — on the CI workflow, Opus 4.8 scored 4.0, GPT-5.2 3.2, Haiku 3.0, and GPT-4o-mini 1.0 (it hallucinated "no trials exist"). That separation IS the product. - SFT dataset — messages-format demonstrations from trajectories that hit the expert bar. - Preference dataset (RLHF/DPO) — chosen-vs-rejected pairs on matched queries, for reward modeling. Human expert grades are the gold label and flow through the identical schema.

The product the VC sees: the environment library, real agent trajectories from multiple frontier/weaker models side by side, an expert (and a calibrated judge) grading them, a benchmark leaderboard, and SFT + preference datasets falling out the other end — the complete biopharma expert-reasoning data machine, end to end.


The hard questions a VC will ask — and the honest answers

Q: Is the data real, or is this synthetic? Today the environments run on curated teaching snapshots derived from real sources (real trials, targets, patents, drugs). That is the correct design for a reproducible RL environment — live data shifts under you and breaks grade comparability — but it's a credibility risk with a bio audience. Mitigation, shipped: the competitive-landscape environment includes a live ClinicalTrials.gov v2 refresh path proving the snapshot is real and refreshable; every other environment's data_sources maps each input to a citable real dataset. Roadmap: one-click live pulls per environment, behind a fixed-seed snapshot for reproducibility.

Q: Why won't Mercor / Scale / Surge just do this? They have the labeler marketplace; they lack the biopharma data infrastructure and the domain depth to author environments experts respect. Our wedge is vertical: we're the biopharma environment specialist, and we can supply them as well as labs directly.

Q: Where's the moat — isn't an environment copyable? The compounding asset is (1) the breadth of the environment library across the ~35 workflows, (2) the accumulated expert grades (proprietary reward data), and (3) the network of vetted biopharma experts. Any one environment is copyable; the library + graded corpus + expert bench is not, quickly.

Q: Who actually pays, and for what? Two buyers. Frontier labs buy environments + graded trajectories to train/eval agents on expert domains (Mercor's proven motion). Bio-AI & pharma buy evals to measure whether an agent is safe to deploy on a workflow, and fine-tuning data to make it competent. Pricing follows the category: per-environment build + per-graded-trajectory, or a data subscription.

Q: Can you get the experts? The scarcest input. The bet is that a vertical brand ("the place serious biopharma minds grade frontier AI on real drug-development problems") plus per-grade economics that respect their rate ($200–800/hr) recruits better in-domain than a horizontal marketplace. This is the #1 thing to de-risk next, and the honest top risk.


Top risks (stated plainly)

  1. Expert supply & quality — the whole thesis rests on recruiting and retaining credible graders. De-risk first.
  2. Buyer concentration — early revenue likely leans on a few frontier labs; their in-housing or budget shifts are existential. Counter by also selling evals to bio-AI/pharma.
  3. "Real enough" data — bio buyers are unforgiving about realism; the curated-snapshot story must visibly connect to live sources (in progress).
  4. Commoditization of environments — mitigated only by library breadth + graded-data moat + expert network, which must be built deliberately.

What to build next (post-seed)

  1. Recruit 3–5 marquee experts per domain; collect real grades to seed the proprietary corpus.
  2. Live-data pulls behind reproducible snapshots for every environment.
  3. Inter-grader agreement + a reward-model trained on collected grades (proves the data has signal).
  4. Expand from 10 → 20+ environments across the 7 domains; land one paid lab/bio-AI pilot.

Bottom line: the category is real and funded, biopharma is the highest-value vertical to own, and we have an unfair head start on the data infrastructure. The idea is sound. The execution risk that matters is expert supply — everything else is buildable, and the demo shows the machine works.