21 frontier models ranked on MMLU, HumanEval, GPQA, and MATH — including Claude Opus 4.7, GPT-5.5, Gemini 3.1 Pro, and DeepSeek V4. Click any column to sort. Filter by category. Bookmark this page.
| # | Model | MMLU % | HumanEval % | GPQA % | MATH % | Context | $/1M tok in/out | Our Pick |
|---|
Benchmark sources: MMLU (Massive Multitask Language Understanding) scores from Hendrycks et al. 2021, evaluated 5-shot unless noted. HumanEval pass@1 from Chen et al. 2021. GPQA Diamond from Rein et al. 2023. MATH from Hendrycks et al. 2021 (Lightman/Minerva split). All scores reflect May 2026 official technical reports and public evals from model providers or independent replication studies.
May 2026 update: Added Claude Opus 4.7 (released Apr 16, 2026 — 1M context, 94.2% GPQA Diamond, 87.6% SWE-bench Verified), GPT-5.5 (released Apr 23, 2026 — first fully retrained OpenAI base model since GPT-4.5, 1M API context, 88.7% SWE-Bench Verified), Gemini 3.1 Pro (Feb 19, 2026 — leads GPQA Diamond at 94.3%, 80.6% SWE-Bench, 7.5× cheaper than Claude Opus on input), DeepSeek V4-Pro & V4-Flash (Apr 24, 2026 — V4-Pro 1.6T params/49B activated MoE, 1M context, V4-Flash 107× cheaper than GPT-5.5 on output). MMLU is approaching saturation (88-94% for top models) and no longer cleanly differentiates frontier capability — for that, see GPQA Diamond and SWE-Bench Verified.
Honest caveat: Benchmarks diverge substantially from real-world task performance. A model with a higher MMLU score may underperform a lower-ranked model on your specific use case. Coding benchmarks (HumanEval) use Python-only problems and may not reflect multilingual or systems programming capability. GPQA measures graduate-level science reasoning — not general-purpose intelligence. Costs shown are list prices at time of publication; enterprise contracts vary significantly.
Open-source models: Scores shown are from official Meta/Mistral/Google/DeepSeek/Alibaba technical reports with standard prompting. Scores may vary based on quantization, inference infrastructure, and prompt format.