Google just released the most permissive, most capable open AI model family it has ever built. Gemma 4 shipped on April 2, 2026 under the Apache 2.0 license — four variants, natively multimodal across every size, with the smaller ones engineered to run completely offline on edge hardware. You can download the weights, fine-tune them, ship them in a product you sell, and you owe Google nothing but attribution.
For most of the last eighteen months, open-weight releases have been technically impressive but commercially awkward. Llama's licenses came with use restrictions. Mistral's strongest models went closed. Qwen was great but from a company many enterprises couldn't buy from. Gemma 4 is the first release that is simultaneously frontier-adjacent in quality, commercially unrestricted, and genuinely deployable on the edge.
The 5-Second Version
- Four sizes: Effective 2B (E2B), Effective 4B (E4B), 26B Mixture of Experts, and 31B Dense.
- Apache 2.0 license — the gold standard for commercial use. No restrictions, no rug-pulls.
- Natively multimodal across all sizes: text, images, video, OCR. Smaller variants add audio.
- 256K context windows — enough for an entire medium codebase or a day of meeting transcripts.
- 31B ranks #3 open model on the Arena leaderboard. 26B ranks #6.
- E2B and E4B run offline on phones, laptops, Raspberry Pi, and Jetson Orin Nano.
The Four Sizes
Gemma 4 isn't one model — it's a family designed around deployment tier, not research prestige. Each variant targets a specific hardware class and use case.
Every variant is multimodal by default. Text, images, video, OCR — all handled natively without separate modality adapters. The 2B and 4B variants additionally support audio input for speech recognition, which is a first for an open model at that parameter count.
Why Gemma 4 Actually Matters
Two things separate Gemma 4 from the parade of open-weight releases we've seen over the past eighteen months. The license, and the offline story.
Technically Open, Practically Awkward
Use restrictions on commercial deployment. License that can change. Won't run offline without quantization tricks that tank quality. Great for research, hard to put into a product you sell.
Apache 2.0, Offline-First
Full commercial use, attribution only. License locked in permanently. E2B and E4B engineered for edge deployment with near-zero latency on a phone, Pi, or Jetson. First credible offline-first frontier family.
In terms of raw quality, the 31B model outcompetes models 20× its size on a set of reasoning and agentic benchmarks. That's Google's framing, and the claim to pay attention to is that the 31B is ranking as the #3 open model in the world right now. It won't out-reason GPT-5.4 or Claude Opus 4.6 on the hardest questions. But for the vast majority of production AI workloads — document Q&A, structured extraction, classification, agent tool use, coding assistance — it is fully sufficient.
Who Should Use It
Enterprise with Data Residency
If your data can't leave your VPC, AWS region, or government cloud, Gemma 4 is now a first-class option. Deploy the 31B on a GPU instance, point your RAG pipeline at it, and get frontier-adjacent quality without sending a token to OpenAI or Anthropic.
On-Device Product Teams
Building AI into a phone app, a medical device, a robot, an embedded system? The E2B and E4B variants are the first open family that make offline inference genuinely practical. Audio support means voice interfaces that work on a plane.
Developers Learning AI Engineering
Running Gemma 4 locally is a better teacher than hitting a paid API. You see the tokenizer, tweak the sampler, profile the inference, and watch what happens when you change prompts. No budget anxiety. No rate limits. No black box.
Regulated Industries
HIPAA, FedRAMP, SOC 2 compliance gets simpler when the model runs on infrastructure you already control. Gemma 4 doesn't solve compliance, but it removes the biggest blocker: sending protected data to a third-party API.
What It Still Can't Do
Let me be direct, Bo's-voice direct. Gemma 4 is not going to replace Claude or GPT for the most demanding tasks. The closed frontier models still have meaningfully better long-horizon reasoning, better tool use, and better calibration on rare or adversarial questions.
If you're building something that needs the absolute best reasoning available — multi-step agent planning over a large codebase, adversarial security auditing, high-stakes medical decision support — Gemma 4 is not the right call. Pay for the closed model. The hours you'd spend compensating for the quality gap are worth more than the API bill.
How to Start Using It Today
The fastest path is Hugging Face. The weights went live on the Hub at launch. For on-device experimentation, LM Studio and Ollama both support Gemma 4 — download the app, pick the variant, run inference on your laptop in under five minutes.
from transformers import pipeline # Pull the 4B model — runs on a MacBook Air or Pi 5 gen = pipeline( "text-generation", model="google/gemma-4-4b-it", device_map="auto", ) prompt = "Summarize this contract in 3 bullet points:\n\n" + contract_text result = gen(prompt, max_new_tokens=400) print(result[0]["generated_text"])
For production deployment on Google Cloud, Vertex AI has a one-click Gemma 4 deployment path. For serverless inference without managing GPUs, Groq and Together AI both added Gemma 4 endpoints on launch day.
The Bottom Line
If you're learning AI engineering in 2026, this is the release that makes "run it locally and actually understand what's happening" a real option for the first time. Go download the 4B, point it at your own data, and see what falls out.
Want to Build With Models Like Gemma 4?
The 2-day in-person Precision AI Academy bootcamp covers open models, RAG, agents, and deployment. 5 cities. $1,490. 40 seats max. June–October 2026 (Thu–Fri).
Reserve Your SeatApache 2.0 plus edge inference changes who can actually ship AI products.
The licensing is the underreported story here. Apache 2.0 is not just "free to use" — it means a startup can build a commercial product on Gemma 4, ship it to customers, and never owe Google a cent or a compliance conversation. Combined with on-device inference, this effectively removes two of the three major barriers to AI deployment in regulated and offline environments: cost and connectivity. The third barrier — data privacy compliance — also improves because the data never leaves the device. That's a meaningful shift for healthcare, defense-adjacent, and industrial IoT use cases.
Our reading is that Gemma 4 will accelerate a class of applications that API-based models can't serve: embedded devices where latency matters more than peak accuracy, air-gapped government networks, consumer apps that can't afford $0.01-per-query inference costs at scale. The Effective 2B variant running on a Raspberry Pi 5 at roughly 15 tokens per second is slow by server standards but entirely adequate for a document classification pipeline or a local voice assistant. Llama 3.2 3B was the previous benchmark here; Gemma 4's multimodal support by default is a real step up.
If you're building anything that needs to process images, text, or audio without a cloud dependency, this is the first genuinely practical open model family for that use case. The 256K context window on the larger variants is especially useful for document-heavy workflows. Don't wait for a "better" open model — this one is good enough to ship.