Benchmark — biopharma workflow competence

Every model in the panel run on each environment's benchmark query, then scored 1–5 by an LLM judge calibrated to the same expert rubric a human grades on (judge: claude-opus-4-8). This is the eval product: how good is each model at each biopharma workflow.

Model	Overall	Competitive Trial-	Epidemiology-Based	Patent / Freedom-t	Eligibility-Criter	Adverse-Event Codi	Target Validation	First-in-Human Sta	FDA Regulatory Str	Systematic Review	BD Asset Due Dilig
Claude Opus 4.8 anthropic frontier	4.36	4.0	3.8	4.6	4.8	5.0	4.6	5.0	4.8	3.6	3.4
Claude Haiku 4.5 anthropic small	3.62	3.0	2.6	3.8	4.4	3.8	4.0	5.0	3.6	3.0	3.0
GPT (frontier) openai frontier	3.58	3.2	3.6	1.4	4.8	4.4	4.2	5.0	3.2	3.2	2.8
GPT-4o mini openai small	2.7	1.0	2.4	2.0	4.0	2.8	4.2	3.4	3.8	1.2	2.2

Cells are mean rubric score (1–5). Green ≥4, amber ≥3, red <3. Open Compare models to read the trajectories side by side.