An honest assessment, written to survive a sharp biopharma VC, not to flatter the idea.
We build the expert-graded RL environments and the labeled trajectory data that frontier labs and bio-AI companies need to make AI agents competent at high-value biopharma knowledge work — "Mercor for biopharma workflows," but productized as reusable environments, not just a labor marketplace.
The training-data frontier has moved from web text to expert RL environments. Pre-training on scraped text is saturating. The marginal capability gain now comes from verifiable environments and expert feedback on agent trajectories — exactly the asset class here. Mercor (~$10B, 2025), Surge, Scale, Turing, and the "RL gym" wave (Prime Intellect, Mechanize) are all funded on this thesis. The category has demonstrated lab willingness to pay at scale.
Biopharma is one of the highest willingness-to-pay knowledge-work domains on earth. A single target-prioritization, FTO, or trial-design decision routinely gates $50M–$5B. The experts who make these calls (MDs, PharmDs, patent attorneys, computational biologists) bill $200–800/hr and are scarce. Labs and pharma will pay a large premium for data that teaches an agent to do this work — because the downstream decision is worth orders of magnitude more than the label.
These workflows decompose into gradable units. We catalogued ~35 distinct biopharma workflows across 7 functional domains (discovery → preclinical → clinical → regulatory → medical → commercial/BD → CMC). Each has a defined role, structured public data inputs (AACT, GTEx/TCGA, patents, FAERS, labels), and an output an expert can score 1–5. That gradability is what makes it a data product, not consulting.
The deliverable is an appreciating asset, not labor hours. Each environment is reusable: it generates unlimited agent trajectories, and every expert grade becomes answer-key / reward signal. The library + accumulated grades compound into a proprietary dataset and, eventually, biopharma-specific reward models — the moat a pure marketplace lacks.
A generic environment framework + 10 fully working environments spanning the value chain, each with read-only data tools, a 5-dimension expert rubric, a deterministic answer key where one exists, and example queries — viewable and gradable in a web app:
| Workflow (of ~35) | Domain | Role / scarce expert | What one decision is worth |
|---|---|---|---|
| Target validation / tractability | Discovery | Target biologist | most pipeline failures are wrong targets |
| First-in-human dose selection | Clinical pharm | Clinical pharmacologist | TGN1412-style catastrophe vs failed trial |
| Competitive trial-landscape | Clinical / CI | Oncology CI analyst (MD/PharmD) | $50M–$500M development bets |
| Eligibility-criteria authoring | Clinical | Medical monitor (MD) | patient safety + 12mo / millions in enrollment |
| AE coding & causality (PV) | Pharmacovigilance | Drug-safety physician | missed safety signals → withdrawals |
| Regulatory strategy | Regulatory | Reg-affairs lead / ex-FDA | years off time-to-market |
| Systematic review / meta-analysis | Evidence | Evidence scientist / biostatistician | drives guidelines, HTA, go/no-go |
| Epidemiology market sizing | Commercial | Forecasting analyst | $100M–$5B licensing / M&A valuation |
| Patent / freedom-to-operate | BD / IP | IP attorney ($400–800/hr) | kills or clears a whole asset |
| BD asset due diligence / rNPV | Corp dev | BD/licensing analyst | anchors $100M–$5B deal prices |
The agent sees only the query + tools — never the answer key — so trajectories are grounded in the environment, and the expert grades reasoning, not recall. Trajectories run on any frontier or weaker model, across providers (Claude and OpenAI), which is what makes the eval meaningful.
This is more than a viewer — it's the full data machine: - Benchmark / eval — every model in a panel (Claude Opus 4.8, Haiku 4.5, GPT-5.2, GPT-4o-mini) run on each environment and scored 1–5 by an LLM judge calibrated to the same expert rubric a human grades on. Produces a model × workflow leaderboard. Early result: the environments cleanly discriminate — on the CI workflow, Opus 4.8 scored 4.0, GPT-5.2 3.2, Haiku 3.0, and GPT-4o-mini 1.0 (it hallucinated "no trials exist"). That separation IS the product. - SFT dataset — messages-format demonstrations from trajectories that hit the expert bar. - Preference dataset (RLHF/DPO) — chosen-vs-rejected pairs on matched queries, for reward modeling. Human expert grades are the gold label and flow through the identical schema.
The product the VC sees: the environment library, real agent trajectories from multiple frontier/weaker models side by side, an expert (and a calibrated judge) grading them, a benchmark leaderboard, and SFT + preference datasets falling out the other end — the complete biopharma expert-reasoning data machine, end to end.
Q: Is the data real, or is this synthetic?
Today the environments run on curated teaching snapshots derived from real sources (real trials,
targets, patents, drugs). That is the correct design for a reproducible RL environment — live
data shifts under you and breaks grade comparability — but it's a credibility risk with a bio
audience. Mitigation, shipped: the competitive-landscape environment includes a live
ClinicalTrials.gov v2 refresh path proving the snapshot is real and refreshable; every other
environment's data_sources maps each input to a citable real dataset. Roadmap: one-click live
pulls per environment, behind a fixed-seed snapshot for reproducibility.
Q: Why won't Mercor / Scale / Surge just do this? They have the labeler marketplace; they lack the biopharma data infrastructure and the domain depth to author environments experts respect. Our wedge is vertical: we're the biopharma environment specialist, and we can supply them as well as labs directly.
Q: Where's the moat — isn't an environment copyable? The compounding asset is (1) the breadth of the environment library across the ~35 workflows, (2) the accumulated expert grades (proprietary reward data), and (3) the network of vetted biopharma experts. Any one environment is copyable; the library + graded corpus + expert bench is not, quickly.
Q: Who actually pays, and for what? Two buyers. Frontier labs buy environments + graded trajectories to train/eval agents on expert domains (Mercor's proven motion). Bio-AI & pharma buy evals to measure whether an agent is safe to deploy on a workflow, and fine-tuning data to make it competent. Pricing follows the category: per-environment build + per-graded-trajectory, or a data subscription.
Q: Can you get the experts? The scarcest input. The bet is that a vertical brand ("the place serious biopharma minds grade frontier AI on real drug-development problems") plus per-grade economics that respect their rate ($200–800/hr) recruits better in-domain than a horizontal marketplace. This is the #1 thing to de-risk next, and the honest top risk.
Bottom line: the category is real and funded, biopharma is the highest-value vertical to own, and we have an unfair head start on the data infrastructure. The idea is sound. The execution risk that matters is expert supply — everything else is buildable, and the demo shows the machine works.