Gemma 4 Is Out: Apache 2.0, 3.8B Active Params, and the Best Local Model in 2026
Google dropped Gemma 4 on April 2 with four variants, a 256K context window, and — finally — an Apache 2.0 license. The 26B MoE activates only 3.8B params at inference. Here's what changed, what it means for local AI, and how it stacks up.
Google's Gemma 3 launched over a year ago and stayed competitive longer than most expected. Now Gemma 4 is here, and it's a meaningful step up — not just in raw performance, but in one area that mattered more than any benchmark: the license.
Starting with Gemma 4, Google has dropped its custom Gemma Terms of Use and switched to Apache 2.0. That single change may matter more for enterprise adoption than every performance gain in the release. This shift aligns well with the broader trends in AI licensing that prioritize flexibility and community-driven development, as discussed in AI Agent Guardrails & Output Validation in 2026: Tools, Patterns & Best Practices.
Here's the full breakdown: what's in Gemma 4, why Apache 2.0 is such a big deal, and how to run it today.
What Is Gemma 4?
Gemma 4 is Google's latest family of open-weight large language models, released April 2, 2026. Built on the same underlying architecture as the closed Gemini 3 models, it comes in four variants optimized for different hardware tiers — from data center GPUs to smartphones. This multi-faceted approach mirrors the strategies outlined in Multi-Agent Orchestration: A Practical Guide for 2026, emphasizing adaptability and performance across various platforms.
The four models:
| Model | Params | Active at Inference | Context | Target Hardware |
|---|---|---|---|---|
| Gemma 4 31B Dense | 31B | 31B | 256K | 80GB H100 (unquantized) or consumer GPU (quantized) |
| Gemma 4 26B MoE | 26B | 3.8B | 256K | 80GB H100 (unquantized) or consumer GPU (quantized) |
| Gemma 4 E4B | ~4B effective | ~4B | 128K | Smartphones, Raspberry Pi, Jetson Nano |
| Gemma 4 E2B | ~2B effective | ~2B | 128K | Smartphones, low-power edge devices |
The standout number is the 26B MoE: 3.8 billion active parameters during inference. Despite 26 billion total parameters in the model weights, inference only touches a small fraction of them on any given forward pass — which means tokens-per-second far higher than a comparably sized dense model. This efficiency is crucial for deploying models like Gemma 4 in resource-constrained environments, as highlighted in MCP Is Not Dead: Why Server-Side MCP Changes Everything for AI Agents.
The Apache 2.0 Shift: Why It Actually Matters
Gemma 3 shipped with a custom Google license that caused real problems for enterprise developers:
- Prohibited-use policy that Google could update unilaterally — your legal compliance cou
ld change without any action on your part
- Required developers to enforce Google's rules across all downstream projects — if you built on Gemma and someone used your product for a prohibited use, you were potentially liable
- Synthetic data transfer clause — a reading of the license suggested that models trained on data generated by Gemma might be subject to Gemma's license terms, which killed use of Gemma for data augmentation pipelines
- No commercial clarity — legal teams at enterprises routinely flagged the custom license as too uncertain to build on
Apache 2.0 eliminates all of this. Apache 2.0 is one of the most widely understood open-source licenses in existence. It's permissive, commercially friendly, and — crucially — Google can't unilaterally change it. If you build a product on Gemma 4 under Apache 2.0, the terms are locked in.
This is the license change that was blocking significant enterprise adoption of Gemma. It's now resolved.
Performance: Where Gemma 4 Actually Lands
Google claims Gemma 4 31B Dense launches at #3 on the LMSYS Chatbot Arena open-model leaderboard, behind only GLM-5 and Kimi 2.5. Those two models are dramatically larger — the ELO score per parameter count is where Gemma 4 is genuinely exceptional.
Key capabilities:
- Reasoning and math: Improved over Gemma 3, built on Gemini 3's architecture
- Code generation: Google claims competitive with Gemini Pro and Claude Code on quality, with the advantage of running entirely locally (offline, private, no API costs)
- Vision/OCR: Better at processing visual input — chart understanding, document OCR — compared to Gemma 3
- Structured output: Native JSON output support, important for agentic workflows
- Function calling: Native support for tool/function call formats — agents can use Gemma 4 as a local reasoning backbone
All four models support 140+ languages and include native structured JSON output, function calling, and common tool/API instruction formats.
The MoE Architecture: How 26B Becomes 3.8B
Mixture of Experts (MoE) is worth understanding because it fundamentally changes the inference math.
In a standard dense model, every parameter is involved in every forward pass. A 26B dense model processes a token through all 26B parameters — memory bandwidth and compute scale linearly with parameter count.
In an MoE model, the network has many "expert" sub-networks, but each token is routed to only a small subset of them. Gemma 4 26B MoE activates roughly 3.8B parameters per forward pass despite having 26B in total.
Practical implications:
- Memory: You still need to load all 26B parameters into RAM/VRAM (the full model weights must be present)
- Compute: Each token only passes through ~3.8B params worth of computation, so throughput is much higher
- Quality: The model has "seen" 26B params worth of specialization during training; routing selects the relevant expertise for each token
For local inference, MoE is excellent when VRAM isn't the bottleneck. If you have the memory to hold the model, tokens-per-second will feel closer to a 4B dense model than a 26B dense model.
The Edge Models: E2B and E4B
The two edge variants — Effective 2B and Effective 4B — are designed for on-device deployment. "Effective" refers to the active parameter count during inference, similar to the MoE logic.
Target hardware:
- Raspberry Pi 4/5
- NVIDIA Jetson Nano
- Qualcomm Snapdragon (optimized in collaboration with Qualcomm)
- MediaTek Dimensity (MediaTek-optimized)
- Android smartphones
Key capabilities of E2B/E4B:
- 128K context window — substantial for edge devices
- Near-zero latency — Google's claim, validated by the on-device optimization work with Qualcomm and MediaTek
- Speech recognition — native support, improved from Gemma 3n
- Lower memory/battery than Gemma 3 equivalents
The E-series models are also the foundation for Gemini Nano 4 — Google confirmed to Ars Technica that the next-gen Pixel on-device AI (Gemini Nano 4) will be based on Gemma 4 E2B and E4B. Developers building with E2B/E4B today will be forward-compatible with Gemini Nano 4 when it ships.
How to Run Gemma 4 Today
Ollama (easiest for local dev)
# Pull and run Gemma 4 26B MoE
ollama pull gemma4:26b-moe
# Or the 31B Dense
ollama pull gemma4:31b
# Run
ollama run gemma4:26b-moe
Models are available at ollama.com/library/gemma4.
Hugging Face
Full model weights with all quantization variants:
google/gemma-4-31b-it and google/gemma-4-26b-moe-it (instruction-tuned)
google/gemma-4-e4b-it and google/gemma-4-e2b-it
Available at huggingface.co/collections/google/gemma-4.
Google AI Studio
Cloud-hosted inference for 31B Dense and 26B MoE — free tier available. No local hardware required. Good for evaluation before committing to local deployment.
Kaggle
Full model weights via Kaggle Models — useful if you're already in the Kaggle ecosystem.
Quantization and Hardware Requirements
Unquantized (bfloat16) requirements:
- 31B Dense: ~62GB VRAM → requires 80GB H100 or multi-GPU setup
- 26B MoE: ~52GB VRAM → same requirement, but inference compute is much lower
- E4B: ~8GB VRAM → consumer GPU territory (RTX 3060 and above)
- E2B: ~4GB VRAM → most modern consumer GPUs
With INT4 quantization (GGUF via llama.cpp / Ollama):
- 31B Dense: ~16-20GB → fits in a 24GB consumer GPU (RTX 3090, 4090, etc.)
- 26B MoE: ~13-16GB → same consumer GPU range
- E4B: ~2-3GB → runs on CPU with reasonable performance
For the Forge/Berserki setup (RTX 5090 32GB / RTX 5060 Ti 16GB): the 26B MoE at INT4/INT8 is the practical sweet spot — full-quality reasoning at fast throughput.
Gemma 4 vs. The Field
| Model | Params | License | Context | Local? | Notable |
|---|---|---|---|---|---|
| Gemma 4 31B Dense | 31B | Apache 2.0 | 256K | ✅ | #3 Arena, Gemini 3 architecture |
| Gemma 4 26B MoE | 26B (3.8B active) | Apache 2.0 | 256K | ✅ | Fast inference, MoE efficiency |
| Qwen 2.5-Coder 32B | 32B | Apache 2.0 | 128K | ✅ | Best local coding model |
| Qwen 3.6-Plus | unknown | Alibaba custom | 1M | ❌ (API) | Frontier-level, agentic |
| Llama 3.3 70B | 70B | Meta custom | 128K | ✅ (large GPU) | Strong all-around |
| Mistral Large 2 | 123B | Mistral custom | 128K | ✅ (multi-GPU) | Enterprise focus |
Gemma 4's combination of Apache 2.0 licensing + 256K context + #3 Arena position makes it the clearest recommendation for teams that need a capable open-weight model for agentic workflows, coding assistance, or private inference.
Who Should Use Gemma 4
Local AI developers: If you're running a local LLM stack for private inference (no cloud, no API costs, no data leaving your machine), Gemma 4 26B MoE is now the strongest option in its hardware class.
Enterprise teams: The Apache 2.0 license removes the legal uncertainty that blocked Gemma 3 adoption. Build on it, fine-tune it, ship products based on it — the license is clean.
Agent builders: Native function calling and structured JSON output means Gemma 4 works as a local reasoning backbone for tool-using agents. No API dependency, full offline capability.
Edge/mobile developers: E2B/E4B with Qualcomm/MediaTek optimization and forward-compatibility with Gemini Nano 4 makes these the best-supported edge models Google has shipped.
Teams running code generation: Google claims competitive performance with Gemini Pro and Claude Code on code quality. For offline/private code gen (legal code, proprietary systems, airgapped environments), Gemma 4 31B is now a serious option.
The Bottom Line
Gemma 4 is the version of Gemma that Google should have shipped a year ago. The Apache 2.0 license alone resolves the single biggest adoption blocker for enterprise developers. The technical improvements — 256K context, MoE efficiency, improved reasoning, agent-native features, edge model improvements — are real additions on top of a now-clean licensing foundation.
For local AI deployment in 2026, Gemma 4 26B MoE is the new default recommendation. Fast, capable, truly open.
Start here: Gemma 4 on Ollama | Gemma 4 on Hugging Face | Gemma 4 in AI Studio
FAQ
What is Gemma 4?
Gemma 4 is Google's latest family of open-weight large language models, released April 2, 2026. It comes in four variants: 31B Dense, 26B MoE, E4B, and E2B. All are based on the same architecture as Google's closed Gemini 3 models and are now licensed under Apache 2.0.
Why is the Apache 2.0 license important for Gemma 4?
Previous Gemma versions used a custom Google license with restrictive terms that Google could change unilaterally. Apache 2.0 is one of the most permissive and legally clear open-source licenses — no commercial restrictions, no overbearing use policies, and Google can't retroactively change the terms for code you've already shipped.
What does "3.8B active parameters" mean for the 26B MoE model?
The 26B MoE uses a Mixture of Experts architecture. Although 26 billion parameters exist in the model weights, each token only passes through ~3.8 billion of them during inference. This means inference speed is comparable to a 4B dense model while maintaining the quality of a 26B trained model.
What hardware do I need to run Gemma 4?
Quantized (INT4): 31B Dense needs ~16-20GB VRAM (RTX 3090/4090), E4B needs ~3GB VRAM (most consumer GPUs). Unquantized: 31B and 26B require an 80GB H100. The E2B/E4B edge models run on Raspberry Pi, Jetson Nano, and modern smartphones.
Where can I download Gemma 4?
From Hugging Face, Kaggle, and Ollama. Cloud inference is available in Google AI Studio.
Frequently Asked Questions
What is Gemma 4?
Why is the Apache 2.0 license important for Gemma 4?
What does "3.8B active parameters" mean for the 26B MoE model?
What hardware do I need to run Gemma 4?
Where can I download Gemma 4?
🔧 Tools in This Article
All tools →Related Guides
All guides →AI Agent Guardrails & Output Validation in 2026: Tools, Patterns & Best Practices
A production AI agent makes thousands of decisions per hour. Some of those decisions will be wrong. Without guardrails, those wrong decisions reach your…
12 min read
AI ToolsYann LeCun Raises $1.03B for AMI Labs: World Models, JEPA, and What Comes After Transformers
Yann LeCun left Meta's AI lab to launch AMI Labs with a $1.03B seed round — the largest in European history. Backers include Bezos, NVIDIA, and Eric Schmidt. The mission: build world models using JEPA architecture, not transformers. LeCun says LLMs are a dead end.
11 min read
AI Toolsllm-d Joins CNCF Sandbox: Kubernetes-Native LLM Inference Is Here
IBM, Red Hat, and Google's llm-d has been accepted into the CNCF Sandbox — bringing production-grade, Kubernetes-native LLM inference to the cloud-native stack. Here's what it means for teams running vLLM and KServe at scale.
10 min read