Every model in the panel run on each environment's benchmark query, then scored 1–5 by an
LLM judge calibrated to the same expert rubric a human grades on (judge: claude-opus-4-8).
This is the eval product: how good is each model at each biopharma workflow.
| Model | Overall | Competitive Trial- | Epidemiology-Based | Patent / Freedom-t | Eligibility-Criter | Adverse-Event Codi | Target Validation | First-in-Human Sta | FDA Regulatory Str | Systematic Review | BD Asset Due Dilig |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Claude Opus 4.8 anthropic frontier | 4.36 | 4.0 | 3.8 | 4.6 | 4.8 | 5.0 | 4.6 | 5.0 | 4.8 | 3.6 | 3.4 |
| Claude Haiku 4.5 anthropic small | 3.62 | 3.0 | 2.6 | 3.8 | 4.4 | 3.8 | 4.0 | 5.0 | 3.6 | 3.0 | 3.0 |
| GPT (frontier) openai frontier | 3.58 | 3.2 | 3.6 | 1.4 | 4.8 | 4.4 | 4.2 | 5.0 | 3.2 | 3.2 | 2.8 |
| GPT-4o mini openai small | 2.7 | 1.0 | 2.4 | 2.0 | 4.0 | 2.8 | 4.2 | 3.4 | 3.8 | 1.2 | 2.2 |
Cells are mean rubric score (1–5). Green ≥4, amber ≥3, red <3. Open Compare models to read the trajectories side by side.