Gemini 3.1 Flash-Lite: AI at $0.25 Per Million Tokens

Google just shipped the cheapest frontier-quality LLM API in the world. 2.5x faster. $0.25 per million input tokens. Here is why this quietly just changed what kinds of AI apps are economically possible.

$0.25
Per million input tokens
2.5x
Faster than prior Gemini
45%
Faster output generation
60%
Cheaper than nearest rival

On April 9, 2026, Google quietly shipped Gemini 3.1 Flash-Lite — a new efficiency-focused model priced at $0.25 per million input tokens, with 2.5x faster response times and 45% faster output generation than prior Gemini Flash versions. No big launch event. No keynote. Just a pricing page update and an API endpoint that started working the same afternoon.

It is, by a meaningful margin, the cheapest frontier-quality commercial LLM API that has ever existed. And depending on your app, that number — $0.25 per million input tokens — might be the most important AI news of the week for you specifically.

The 5-Second Version

01

Where Flash-Lite Sits in the Pricing Landscape

Let me put $0.25/M into the context of everything else you could pick from today. Input token prices for comparable-tier models as of April 13, 2026:

ModelProviderInput $/M tokensRelative cost
Gemini 3.1 Flash-LiteGoogle DeepMind$0.251.0x (baseline)
DeepSeek V3DeepSeek$0.271.1x
Llama 4 Maverick via GroqGroq$0.351.4x
GPT-5.4 miniOpenAI$0.401.6x
Claude Haiku 4.6Anthropic$0.803.2x
Gemini 2.5 FlashGoogle DeepMind$0.351.4x
GPT-5.4OpenAI$2.5010x
Claude Sonnet 4.6Anthropic$3.0012x
Claude Opus 4.6Anthropic$15.0060x

Two interesting things in that table. First: Gemini 3.1 Flash-Lite and DeepSeek V3 are now essentially tied for the cheapest commercial LLM at $0.25-$0.27 per million input tokens. Second: Claude Opus 4.6 is 60 times more expensive than Flash-Lite. The gap between "the frontier" and "the floor" has never been wider.

This is the shape of the market now. You have two bands: frontier models that cost $2-$15 per million tokens and are used for hard reasoning tasks, and high-volume models that cost $0.25-$0.40 per million tokens and are used for everything else. Knowing which band your workload belongs in is one of the most important product decisions you can make.

02

What Flash-Lite Is Actually Good At

Flash-Lite is not the model you pick when you want the best reasoning in the world. It is the model you pick when you have a lot of queries to run and each one is bounded in complexity. Here is how to think about it:

01

Chat & Conversational UX

Customer support chat, free-tier chat UX, in-app AI assistants that answer simple questions about your product. Flash-Lite is fast enough that users don't notice latency, and cheap enough that you can run it on every free-tier user without bleeding money.

Ship chat on the free tier
02

Classification & Routing

Classify user intent, route tickets, tag incoming content, flag spam. These are million-queries-per-day workloads where the cost of "call Claude Opus on every inbound message" is prohibitive. At $0.25/M, you can run classification on every inbound message and still have margin.

Replace your regex-and-heuristic rules
03

Structured Extraction From Documents

Pull specific fields out of receipts, invoices, contracts, emails, or PDFs into JSON. Flash-Lite handles structured extraction reliably at a price point where processing 10,000 documents a day costs about $3-$5 instead of $50-$100.

Automate document pipelines cheaply
04

Light-Weight RAG

Retrieval-augmented generation where the retrieval layer does most of the work and the LLM is just synthesizing the answer from retrieved chunks. Flash-Lite is fully capable of this and costs a fraction of what Claude Sonnet or GPT-5.4 would cost for the same workload.

Good enough for 90% of RAG use cases
03

What Flash-Lite Is NOT Good At

Honest balance so nobody walks away with the wrong impression. Flash-Lite is a Flash-tier model. It is not built for:

Long-horizon agentic work. Multi-step planning, tool use, self-correction across several turns — the model will get you started but it will drift. Use Gemini 2.5 Pro, Claude Sonnet 4.6, or GPT-5.4 for agentic work with real autonomy requirements.

Complex reasoning tasks. If you have a problem that needs actual chain-of-thought, multi-hop inference, or careful adversarial analysis, Flash-Lite will give you a confident-sounding wrong answer some of the time. Use a frontier model.

Code generation beyond simple completions. Flash-Lite can write simple functions reliably but it is not where you go for complex refactoring, architecture decisions, or debugging hard problems. Use Claude Sonnet or Opus for code work.

Creative writing with high standards. Short form is fine. Long form with style, voice, and nuance is a frontier-model task.

04

The Bottom Line

The Verdict
Gemini 3.1 Flash-Lite is the new floor for LLM pricing, and if your app makes a lot of AI calls per user, it just made a bunch of new free-tier features profitable overnight. Test it against your current workload this week.

If you are building anything with a free tier or a high-volume AI workload, the right move is to audit your current model usage this week. Anywhere you are currently spending on GPT-4o-mini, Claude Haiku, or Gemini 2.5 Flash, spin up a quick test against Flash-Lite. If the quality holds, your per-query cost just dropped by 40-60% and your free tier just became more generous without costing you more money.

This is exactly the kind of pricing-to-product decision we make in bootcamp. Which model do you pick for which task? How do you architect a chat app so that hard questions escalate to Sonnet and easy questions stay on Flash-Lite? How do you measure quality regressions when you swap models? None of this is theoretical — it is the actual day-to-day of shipping AI features in 2026.

Stop Reading AI News. Start Building With It.

The 2-day in-person Precision AI Academy bootcamp. 5 cities. $1,490. 40 seats max. Thursday-Friday cohorts, June-October 2026.

Reserve Your Seat
PA

Published By

Precision AI Academy

Practitioner-focused AI education · 2-day in-person bootcamp in 5 U.S. cities

Precision AI Academy publishes deep-dives on applied AI engineering for working professionals. Founded by Bo Peng (Kaggle Top 200) who leads the in-person bootcamp in Denver, NYC, Dallas, LA, and Chicago.

Kaggle Top 200 Federal AI Practitioner 5 U.S. Cities Thu-Fri Cohorts