20 frontier models ranked on MMLU, HumanEval, GPQA, and MATH. Click any column to sort. Filter by category. Bookmark this page.
| # | Model | MMLU % | HumanEval % | GPQA % | MATH % | Context | $/1M tok in/out | Our Pick |
|---|
Benchmark sources: MMLU (Massive Multitask Language Understanding) scores from Hendrycks et al. 2021, evaluated 5-shot unless noted. HumanEval pass@1 from Chen et al. 2021. GPQA Diamond from Rein et al. 2023. MATH from Hendrycks et al. 2021 (Lightman/Minerva split). All scores reflect April 2026 official technical reports and public evals from model providers or independent replication studies.
Honest caveat: Benchmarks diverge substantially from real-world task performance. A model with a higher MMLU score may underperform a lower-ranked model on your specific use case. Coding benchmarks (HumanEval) use Python-only problems and may not reflect multilingual or systems programming capability. GPQA measures graduate-level science reasoning — not general-purpose intelligence. Costs shown are list prices at time of publication; enterprise contracts vary significantly.
Open-source models: Scores shown are from official Meta/Mistral/Google/DeepSeek technical reports with standard prompting. Scores may vary based on quantization, inference infrastructure, and prompt format.