Frontier vs. weaker models
Same query, same tools, different models — judged on the expert rubric. This is what an environment is for: it discriminates model quality. Pick a workflow.
Claude Opus 4.8: 5.0 GPT (frontier): 4.4 Claude Haiku 4.5: 3.8 GPT-4o mini: 2.8
Claude Opus 4.8: 3.4 Claude Haiku 4.5: 3.0 GPT (frontier): 2.8 GPT-4o mini: 2.2
Claude Opus 4.8: 4.0 GPT (frontier): 3.2 Claude Haiku 4.5: 3.0 GPT-4o mini: 1.0
Claude Opus 4.8: 4.8 GPT (frontier): 4.8 Claude Haiku 4.5: 4.4 GPT-4o mini: 4.0
Claude Haiku 4.5: 5.0 Claude Opus 4.8: 5.0 GPT (frontier): 5.0 GPT-4o mini: 3.4
Claude Opus 4.8: 4.6 Claude Haiku 4.5: 3.8 GPT-4o mini: 2.0 GPT (frontier): 1.4
Claude Opus 4.8: 3.8 GPT (frontier): 3.6 Claude Haiku 4.5: 2.6 GPT-4o mini: 2.4
Claude Opus 4.8: 4.8 GPT-4o mini: 3.8 Claude Haiku 4.5: 3.6 GPT (frontier): 3.2
Claude Opus 4.8: 3.6 GPT (frontier): 3.2 Claude Haiku 4.5: 3.0 GPT-4o mini: 1.2
Claude Opus 4.8: 4.6 GPT-4o mini: 4.2 GPT (frontier): 4.2 Claude Haiku 4.5: 4.0