Updated May 2026

LLM Leaderboard 2026

21 frontier models ranked on MMLU, HumanEval, GPQA, and MATH — including Claude Opus 4.7, GPT-5.5, Gemini 3.1 Pro, and DeepSeek V4. Click any column to sort. Filter by category. Bookmark this page.

21 models — click any column header to sort

Last updated: May 3, 2026
# Model MMLU % HumanEval % GPQA % MATH % Context $/1M tok in/out Our Pick

Methodology & Disclaimer

Benchmark sources: MMLU (Massive Multitask Language Understanding) scores from Hendrycks et al. 2021, evaluated 5-shot unless noted. HumanEval pass@1 from Chen et al. 2021. GPQA Diamond from Rein et al. 2023. MATH from Hendrycks et al. 2021 (Lightman/Minerva split). All scores reflect May 2026 official technical reports and public evals from model providers or independent replication studies.

May 2026 update: Added Claude Opus 4.7 (released Apr 16, 2026 — 1M context, 94.2% GPQA Diamond, 87.6% SWE-bench Verified), GPT-5.5 (released Apr 23, 2026 — first fully retrained OpenAI base model since GPT-4.5, 1M API context, 88.7% SWE-Bench Verified), Gemini 3.1 Pro (Feb 19, 2026 — leads GPQA Diamond at 94.3%, 80.6% SWE-Bench, 7.5× cheaper than Claude Opus on input), DeepSeek V4-Pro & V4-Flash (Apr 24, 2026 — V4-Pro 1.6T params/49B activated MoE, 1M context, V4-Flash 107× cheaper than GPT-5.5 on output). MMLU is approaching saturation (88-94% for top models) and no longer cleanly differentiates frontier capability — for that, see GPQA Diamond and SWE-Bench Verified.

Honest caveat: Benchmarks diverge substantially from real-world task performance. A model with a higher MMLU score may underperform a lower-ranked model on your specific use case. Coding benchmarks (HumanEval) use Python-only problems and may not reflect multilingual or systems programming capability. GPQA measures graduate-level science reasoning — not general-purpose intelligence. Costs shown are list prices at time of publication; enterprise contracts vary significantly.

Open-source models: Scores shown are from official Meta/Mistral/Google/DeepSeek/Alibaba technical reports with standard prompting. Scores may vary based on quantization, inference infrastructure, and prompt format.

MMLU — Hendrycks et al. 2021 HumanEval — Chen et al. 2021 GPQA — Rein et al. 2023 MATH — Hendrycks et al. 2021 Costs: official API pricing pages

Learn to pick the right model for your use case.

At the Precision AI Academy bootcamp you'll work hands-on with Claude, GPT, Gemini, and open-source models — comparing them on real tasks, not just benchmarks.

Reserve Your Seat — $1,490