Model Context Windows
Scale: 2,000,000 tokens maxPaste Your Own Text
Estimate token count and see which models can handle it.
Reference Lengths
Click any item to overlay it on the bars
What actually fits in 1 million tokens? Compare every major AI model's context window — and paste your own text to see if it fits.
Estimate token count and see which models can handle it.
Click any item to overlay it on the bars
| Model | Context Tokens | ~Characters | What Fits |
|---|---|---|---|
| Gemini 2.5 ProGoogle DeepMind | 2,000,000 | ~8,000,000 | All Shakespeare + War & Peace + The Bible + more |
| Claude Opus / Sonnet 4.6Anthropic | 1,000,000 | ~4,000,000 | The Bible KJV, War & Peace, LOTR trilogy — all at once |
| Gemini 2.5 FlashGoogle DeepMind | 1,000,000 | ~4,000,000 | Same as Claude 1M — at faster speed & lower cost |
| GPT-5OpenAI | 400,000 | ~1,600,000 | LOTR trilogy, Python stdlib docs, ~3 textbooks |
| Claude Haiku 4.5Anthropic | 200,000 | ~800,000 | Python docs, 2 textbooks, large codebase |
| GPT-4oOpenAI | 128,000 | ~512,000 | A 300-page textbook or a small codebase |
| Llama 4 70BMeta | 128,000 | ~512,000 | A 300-page textbook or a small codebase |
| DeepSeek V3DeepSeek | 64,000 | ~256,000 | ~2 research papers + a day's Slack + one 10-K |
Studies show models can degrade significantly on recall tasks once the relevant information is buried past ~60% of their context window — a phenomenon called the "lost-in-the-middle" problem. Stuffing 1M tokens with loosely related data often hurts accuracy compared to using Retrieval-Augmented Generation (RAG) to surface only the most relevant chunks. Use large context for structured tasks (full codebases, legal docs) — not as a substitute for good retrieval architecture.
A context window is the total amount of text — measured in tokens — that an AI model can "see" at once during a single conversation or task. Tokens are roughly 3–4 characters each: a word like "context" is about 2 tokens, while "a" is 1.
Think of it like RAM. Everything inside the window is available for the model to reason about. Everything outside it is invisible — the model has no access to it unless you explicitly include it.
Transformers — the architecture behind GPT, Claude, and Gemini — use attention to relate every token to every other token in the window. Larger windows let you feed in entire codebases, legal contracts, or books and ask questions across all of it.
But attention computation scales quadratically with length, and models trained on short sequences may not generalize perfectly to very long ones. This is why 1M-token models require specialized training techniques, not just bigger hardware.
Research from Stanford found that LLMs tend to recall information at the beginning and end of a long context well — but often miss facts buried in the middle. This means a 1M-token context is not equivalent to 1M tokens of perfect memory.
For tasks requiring precise recall across massive documents, Retrieval-Augmented Generation (RAG) — where a search index surfaces only the relevant chunks — often outperforms raw long-context prompting.
Use long context when the entire document matters — legal review, full-codebase refactors, summarizing a complete annual report, or analyzing a transcript from start to finish.
Use RAG when you have a large corpus (thousands of documents) and only need relevant snippets, or when cost and latency matter. Sending 100K tokens on every query is expensive — a good retrieval layer means you only send what's needed.