On April 9, 2026, Google quietly shipped Gemini 3.1 Flash-Lite — a new efficiency-focused model priced at $0.25 per million input tokens, with 2.5x faster response times and 45% faster output generation than prior Gemini Flash versions. No big launch event. No keynote. Just a pricing page update and an API endpoint that started working the same afternoon.
It is, by a meaningful margin, the cheapest frontier-quality commercial LLM API that has ever existed. And depending on your app, that number — $0.25 per million input tokens — might be the most important AI news of the week for you specifically.
The 5-Second Version
- Gemini 3.1 Flash-Lite shipped April 9, 2026 from Google DeepMind.
- $0.25 per million input tokens — roughly 60% cheaper than the closest quality-comparable commercial alternative.
- 2.5x faster response time and 45% faster output generation vs previous Gemini.
- Targets high-volume workloads: chat, classification, moderation, RAG, document extraction.
- Not for frontier reasoning tasks — use Gemini 2.5 Pro or Claude Opus for those.
- This is the new floor for commercial LLM pricing. Every other lab will have to respond.
Where Flash-Lite Sits in the Pricing Landscape
Let me put $0.25/M into the context of everything else you could pick from today. Input token prices for comparable-tier models as of April 13, 2026:
| Model | Provider | Input $/M tokens | Relative cost |
|---|---|---|---|
| Gemini 3.1 Flash-Lite | Google DeepMind | $0.25 | 1.0x (baseline) |
| DeepSeek V3 | DeepSeek | $0.27 | 1.1x |
| Llama 4 Maverick via Groq | Groq | $0.35 | 1.4x |
| GPT-5.4 mini | OpenAI | $0.40 | 1.6x |
| Claude Haiku 4.6 | Anthropic | $0.80 | 3.2x |
| Gemini 2.5 Flash | Google DeepMind | $0.35 | 1.4x |
| GPT-5.4 | OpenAI | $2.50 | 10x |
| Claude Sonnet 4.6 | Anthropic | $3.00 | 12x |
| Claude Opus 4.6 | Anthropic | $15.00 | 60x |
Two interesting things in that table. First: Gemini 3.1 Flash-Lite and DeepSeek V3 are now essentially tied for the cheapest commercial LLM at $0.25-$0.27 per million input tokens. Second: Claude Opus 4.6 is 60 times more expensive than Flash-Lite. The gap between "the frontier" and "the floor" has never been wider.
This is the shape of the market now. You have two bands: frontier models that cost $2-$15 per million tokens and are used for hard reasoning tasks, and high-volume models that cost $0.25-$0.40 per million tokens and are used for everything else. Knowing which band your workload belongs in is one of the most important product decisions you can make.
What Flash-Lite Is Actually Good At
Flash-Lite is not the model you pick when you want the best reasoning in the world. It is the model you pick when you have a lot of queries to run and each one is bounded in complexity. Here is how to think about it:
Chat & Conversational UX
Customer support chat, free-tier chat UX, in-app AI assistants that answer simple questions about your product. Flash-Lite is fast enough that users don't notice latency, and cheap enough that you can run it on every free-tier user without bleeding money.
Classification & Routing
Classify user intent, route tickets, tag incoming content, flag spam. These are million-queries-per-day workloads where the cost of "call Claude Opus on every inbound message" is prohibitive. At $0.25/M, you can run classification on every inbound message and still have margin.
Structured Extraction From Documents
Pull specific fields out of receipts, invoices, contracts, emails, or PDFs into JSON. Flash-Lite handles structured extraction reliably at a price point where processing 10,000 documents a day costs about $3-$5 instead of $50-$100.
Light-Weight RAG
Retrieval-augmented generation where the retrieval layer does most of the work and the LLM is just synthesizing the answer from retrieved chunks. Flash-Lite is fully capable of this and costs a fraction of what Claude Sonnet or GPT-5.4 would cost for the same workload.
What Flash-Lite Is NOT Good At
Honest balance so nobody walks away with the wrong impression. Flash-Lite is a Flash-tier model. It is not built for:
Long-horizon agentic work. Multi-step planning, tool use, self-correction across several turns — the model will get you started but it will drift. Use Gemini 2.5 Pro, Claude Sonnet 4.6, or GPT-5.4 for agentic work with real autonomy requirements.
Complex reasoning tasks. If you have a problem that needs actual chain-of-thought, multi-hop inference, or careful adversarial analysis, Flash-Lite will give you a confident-sounding wrong answer some of the time. Use a frontier model.
Code generation beyond simple completions. Flash-Lite can write simple functions reliably but it is not where you go for complex refactoring, architecture decisions, or debugging hard problems. Use Claude Sonnet or Opus for code work.
Creative writing with high standards. Short form is fine. Long form with style, voice, and nuance is a frontier-model task.
The Bottom Line
If you are building anything with a free tier or a high-volume AI workload, the right move is to audit your current model usage this week. Anywhere you are currently spending on GPT-4o-mini, Claude Haiku, or Gemini 2.5 Flash, spin up a quick test against Flash-Lite. If the quality holds, your per-query cost just dropped by 40-60% and your free tier just became more generous without costing you more money.
This is exactly the kind of pricing-to-product decision we make in bootcamp. Which model do you pick for which task? How do you architect a chat app so that hard questions escalate to Sonnet and easy questions stay on Flash-Lite? How do you measure quality regressions when you swap models? None of this is theoretical — it is the actual day-to-day of shipping AI features in 2026.
Stop Reading AI News. Start Building With It.
The 2-day in-person Precision AI Academy bootcamp. 5 cities. $1,490. 40 seats max. Thursday-Friday cohorts, June-October 2026.
Reserve Your Seat