← home

Benchmark — biopharma workflow competence

Every model in the panel run on each environment's benchmark query, then scored 1–5 by an LLM judge calibrated to the same expert rubric a human grades on (judge: claude-opus-4-8). This is the eval product: how good is each model at each biopharma workflow.

ModelOverallCompetitive Trial-Epidemiology-BasedPatent / Freedom-tEligibility-CriterAdverse-Event CodiTarget Validation First-in-Human StaFDA Regulatory StrSystematic Review BD Asset Due Dilig
Claude Opus 4.8
anthropic frontier
4.364.03.84.64.85.04.65.04.83.63.4
Claude Haiku 4.5
anthropic small
3.623.02.63.84.43.84.05.03.63.03.0
GPT (frontier)
openai frontier
3.583.23.61.44.84.44.25.03.23.22.8
GPT-4o mini
openai small
2.71.02.42.04.02.84.23.43.81.22.2

Cells are mean rubric score (1–5). Green ≥4, amber ≥3, red <3. Open Compare models to read the trajectories side by side.