Local LLM

Gemma 4 Is Out: Apache 2.0, 3.8B Active Params, and the Best Local Model in 2026

Google dropped Gemma 4 on April 2 with four variants, a 256K context window, and — finally — an Apache 2.0 license. The 26B MoE activates only 3.8B params at inference. Here's what changed, what it means for local AI, and how it stacks up.

April 3, 2026·12 min read·1,913 words

In short: Gemma 4, released April 2, 2026, switches to Apache 2.0 licensing, the change that removes the enterprise adoption blocker from Gemma 3. It ships in four variants up to 256K context; the 26B MoE activates just 3.8B parameters per token, fitting a 24GB GPU at INT4 with fast throughput.

Google's Gemma 3 launched over a year ago and stayed competitive longer than most expected. Now Gemma 4 is here, and it's a meaningful step up — not just in raw performance, but in one area that mattered more than any benchmark: the license.

Starting with Gemma 4, Google has dropped its custom Gemma Terms of Use and switched to Apache 2.0. That single change may matter more for enterprise adoption than every performance gain in the release. This shift aligns well with the broader trends in AI licensing that prioritize flexibility and community-driven development, as discussed in AI Agent Guardrails & Output Validation in 2026: Tools, Patterns & Best Practices.

Here's the full breakdown: what's in Gemma 4, why Apache 2.0 is such a big deal, and how to run it today.

What Is Gemma 4?

Gemma 4 is Google's latest family of open-weight large language models, released April 2, 2026. Built on the same underlying architecture as the closed Gemini 3 models, it comes in four variants optimized for different hardware tiers — from data center GPUs to smartphones. This multi-faceted approach mirrors the strategies outlined in Multi-Agent Orchestration: A Practical Guide for 2026, emphasizing adaptability and performance across various platforms.

The four models:

Model	Params	Active at Inference	Context	Target Hardware
Gemma 4 31B Dense	31B	31B	256K	80GB H100 (unquantized) or consumer GPU (quantized)
Gemma 4 26B MoE	26B	3.8B	256K	80GB H100 (unquantized) or consumer GPU (quantized)
Gemma 4 E4B	~4B effective	~4B	128K	Smartphones, Raspberry Pi, Jetson Nano
Gemma 4 E2B	~2B effective	~2B	128K	Smartphones, low-power edge devices

The standout number is the 26B MoE: 3.8 billion active parameters during inference. Despite 26 billion total parameters in the model weights, inference only touches a small fraction of them on any given forward pass — which means tokens-per-second far higher than a comparably sized dense model. This efficiency is crucial for deploying models like Gemma 4 in resource-constrained environments, as highlighted in MCP Is Not Dead: Why Server-Side MCP Changes Everything for AI Agents.

The Apache 2.0 Shift: Why It Actually Matters

Gemma 3 shipped with a custom Google license that caused real problems for enterprise developers:

Prohibited-use policy that Google could update unilaterally — your legal compliance cou

ld change without any action on your part

Required developers to enforce Google's rules across all downstream projects — if you built on Gemma and someone used your product for a prohibited use, you were potentially liable
Synthetic data transfer clause — a reading of the license suggested that models trained on data generated by Gemma might be subject to Gemma's license terms, which killed use of Gemma for data augmentation pipelines
No commercial clarity — legal teams at enterprises routinely flagged the custom license as too uncertain to build on

Apache 2.0 eliminates all of this. Apache 2.0 is one of the most widely understood open-source licenses in existence. It's permissive, commercially friendly, and — crucially — Google can't unilaterally change it. If you build a product on Gemma 4 under Apache 2.0, the terms are locked in.

This is the license change that was blocking significant enterprise adoption of Gemma. It's now resolved.

Performance: Where Gemma 4 Actually Lands

Google claims Gemma 4 31B Dense launches at #3 on the LMSYS Chatbot Arena open-model leaderboard, behind only GLM-5 and Kimi 2.5. Those two models are dramatically larger — the ELO score per parameter count is where Gemma 4 is genuinely exceptional.

Key capabilities:

Reasoning and math: Improved over Gemma 3, built on Gemini 3's architecture
Code generation: Google claims competitive with Gemini Pro and Claude Code on quality, with the advantage of running entirely locally (offline, private, no API costs)
Vision/OCR: Better at processing visual input — chart understanding, document OCR — compared to Gemma 3
Structured output: Native JSON output support, important for agentic workflows
Function calling: Native support for tool/function call formats — agents can use Gemma 4 as a local reasoning backbone

All four models support 140+ languages and include native structured JSON output, function calling, and common tool/API instruction formats.

The MoE Architecture: How 26B Becomes 3.8B

Mixture of Experts (MoE) is worth understanding because it fundamentally changes the inference math.

In a standard dense model, every parameter is involved in every forward pass. A 26B dense model processes a token through all 26B parameters — memory bandwidth and compute scale linearly with parameter count.

In an MoE model, the network has many "expert" sub-networks, but each token is routed to only a small subset of them. Gemma 4 26B MoE activates roughly 3.8B parameters per forward pass despite having 26B in total.

Practical implications:

Memory: You still need to load all 26B parameters into RAM/VRAM (the full model weights must be present)
Compute: Each token only passes through ~3.8B params worth of computation, so throughput is much higher
Quality: The model has "seen" 26B params worth of specialization during training; routing selects the relevant expertise for each token

For local inference, MoE is excellent when VRAM isn't the bottleneck. If you have the memory to hold the model, tokens-per-second will feel closer to a 4B dense model than a 26B dense model.

The Edge Models: E2B and E4B

The two edge variants — Effective 2B and Effective 4B — are designed for on-device deployment. "Effective" refers to the active parameter count during inference, similar to the MoE logic.

Target hardware:

Raspberry Pi 4/5
NVIDIA Jetson Nano
Qualcomm Snapdragon (optimized in collaboration with Qualcomm)
MediaTek Dimensity (MediaTek-optimized)
Android smartphones

Key capabilities of E2B/E4B:

128K context window — substantial for edge devices
Near-zero latency — Google's claim, validated by the on-device optimization work with Qualcomm and MediaTek
Speech recognition — native support, improved from Gemma 3n
Lower memory/battery than Gemma 3 equivalents

The E-series models are also the foundation for Gemini Nano 4 — Google confirmed to Ars Technica that the next-gen Pixel on-device AI (Gemini Nano 4) will be based on Gemma 4 E2B and E4B. Developers building with E2B/E4B today will be forward-compatible with Gemini Nano 4 when it ships.

How to Run Gemma 4 Today

Ollama (easiest for local dev)


# Pull and run Gemma 4 26B MoE
ollama pull gemma4:26b-moe

# Or the 31B Dense
ollama pull gemma4:31b

# Run
ollama run gemma4:26b-moe

Models are available at ollama.com/library/gemma4.

Hugging Face

Full model weights with all quantization variants:

google/gemma-4-31b-it and google/gemma-4-26b-moe-it (instruction-tuned)

google/gemma-4-e4b-it and google/gemma-4-e2b-it

Available at huggingface.co/collections/google/gemma-4.

Google AI Studio

Cloud-hosted inference for 31B Dense and 26B MoE — free tier available. No local hardware required. Good for evaluation before committing to local deployment.

Kaggle

Full model weights via Kaggle Models — useful if you're already in the Kaggle ecosystem.

Quantization and Hardware Requirements

Unquantized (bfloat16) requirements:

31B Dense: ~62GB VRAM → requires 80GB H100 or multi-GPU setup
26B MoE: ~52GB VRAM → same requirement, but inference compute is much lower
E4B: ~8GB VRAM → consumer GPU territory (RTX 3060 and above)
E2B: ~4GB VRAM → most modern consumer GPUs

With INT4 quantization (GGUF via llama.cpp / Ollama):

31B Dense: ~16-20GB → fits in a 24GB consumer GPU (RTX 3090, 4090, etc.)
26B MoE: ~13-16GB → same consumer GPU range
E4B: ~2-3GB → runs on CPU with reasonable performance

For the Forge/Berserki setup (RTX 5090 32GB / RTX 5060 Ti 16GB): the 26B MoE at INT4/INT8 is the practical sweet spot — full-quality reasoning at fast throughput.

Gemma 4 vs. The Field

Model	Params	License	Context	Local?	Notable
Gemma 4 31B Dense	31B	Apache 2.0	256K	✅	#3 Arena, Gemini 3 architecture
Gemma 4 26B MoE	26B (3.8B active)	Apache 2.0	256K	✅	Fast inference, MoE efficiency
Qwen 2.5-Coder 32B	32B	Apache 2.0	128K	✅	Best local coding model
Qwen 3.6-Plus	unknown	Alibaba custom	1M	❌ (API)	Frontier-level, agentic
Llama 3.3 70B	70B	Meta custom	128K	✅ (large GPU)	Strong all-around
Mistral Large 2	123B	Mistral custom	128K	✅ (multi-GPU)	Enterprise focus

Gemma 4's combination of Apache 2.0 licensing + 256K context + #3 Arena position makes it the clearest recommendation for teams that need a capable open-weight model for agentic workflows, coding assistance, or private inference.

Who Should Use Gemma 4

Local AI developers: If you're running a local LLM stack for private inference (no cloud, no API costs, no data leaving your machine), Gemma 4 26B MoE is now the strongest option in its hardware class.

Enterprise teams: The Apache 2.0 license removes the legal uncertainty that blocked Gemma 3 adoption. Build on it, fine-tune it, ship products based on it — the license is clean.

Agent builders: Native function calling and structured JSON output means Gemma 4 works as a local reasoning backbone for tool-using agents. No API dependency, full offline capability.

Edge/mobile developers: E2B/E4B with Qualcomm/MediaTek optimization and forward-compatibility with Gemini Nano 4 makes these the best-supported edge models Google has shipped.

Teams running code generation: Google claims competitive performance with Gemini Pro and Claude Code on code quality. For offline/private code gen (legal code, proprietary systems, airgapped environments), Gemma 4 31B is now a serious option.

The Bottom Line

Gemma 4 is the version of Gemma that Google should have shipped a year ago. The Apache 2.0 license alone resolves the single biggest adoption blocker for enterprise developers. The technical improvements — 256K context, MoE efficiency, improved reasoning, agent-native features, edge model improvements — are real additions on top of a now-clean licensing foundation.

For local AI deployment in 2026, Gemma 4 26B MoE is the new default recommendation. Fast, capable, truly open.

Start here: Gemma 4 on Ollama | Gemma 4 on Hugging Face | Gemma 4 in AI Studio

FAQ

What is Gemma 4?

Gemma 4 is Google's latest family of open-weight large language models, released April 2, 2026. It comes in four variants: 31B Dense, 26B MoE, E4B, and E2B. All are based on the same architecture as Google's closed Gemini 3 models and are now licensed under Apache 2.0.

Why is the Apache 2.0 license important for Gemma 4?

Previous Gemma versions used a custom Google license with restrictive terms that Google could change unilaterally. Apache 2.0 is one of the most permissive and legally clear open-source licenses — no commercial restrictions, no overbearing use policies, and Google can't retroactively change the terms for code you've already shipped.

What does "3.8B active parameters" mean for the 26B MoE model?

The 26B MoE uses a Mixture of Experts architecture. Although 26 billion parameters exist in the model weights, each token only passes through ~3.8 billion of them during inference. This means inference speed is comparable to a 4B dense model while maintaining the quality of a 26B trained model.

What hardware do I need to run Gemma 4?

Quantized (INT4): 31B Dense needs ~16-20GB VRAM (RTX 3090/4090), E4B needs ~3GB VRAM (most consumer GPUs). Unquantized: 31B and 26B require an 80GB H100. The E2B/E4B edge models run on Raspberry Pi, Jetson Nano, and modern smartphones.

Where can I download Gemma 4?

From Hugging Face, Kaggle, and Ollama. Cloud inference is available in Google AI Studio.

Frequently Asked Questions

What is Gemma 4?

Why is the Apache 2.0 license important for Gemma 4?

What does "3.8B active parameters" mean for the 26B MoE model?

The 26B MoE uses a Mixture of Experts architecture. Although 26 billion parameters exist in the model weights, each token only passes through 3.8 billion of them during inference. This means inference speed is comparable to a 4B dense model while maintaining the quality of a 26B trained model.

What hardware do I need to run Gemma 4?

Quantized (INT4): 31B Dense needs 16-20GB VRAM (RTX 3090/4090), E4B needs 3GB VRAM (most consumer GPUs). Unquantized: 31B and 26B require an 80GB H100. The E2B/E4B edge models run on Raspberry Pi, Jetson Nano, and modern smartphones.

Where can I download Gemma 4?

From Hugging Face, Kaggle, and Ollama. Cloud inference is available in Google AI Studio.

🔧 Tools in This Article

Make (Integromat)

Hugging Face

Claude Code

Ollama

E2B

Related Guides

All guides →

Guide

AI Agent Guardrails & Output Validation in 2026: Tools, Patterns & Best Practices

A production AI agent makes thousands of decisions per hour. Some of those decisions will be wrong. Without guardrails, those wrong decisions reach your…

12 min read

AI Tools

Yann LeCun Raises $1.03B for AMI Labs: World Models, JEPA, and What Comes After Transformers

Yann LeCun left Meta's AI lab to launch AMI Labs with a $1.03B seed round — the largest in European history. Backers include Bezos, NVIDIA, and Eric Schmidt. The mission: build world models using JEPA architecture, not transformers. LeCun says LLMs are a dead end.

11 min read

AI Tools

llm-d Joins CNCF Sandbox: Kubernetes-Native LLM Inference Is Here

IBM, Red Hat, and Google's llm-d has been accepted into the CNCF Sandbox — bringing production-grade, Kubernetes-native LLM inference to the cloud-native stack. Here's what it means for teams running vLLM and KServe at scale.

10 min read

#local-ai#llm#google#open-source#agent#coding#free