Local LLM

Gemma 4 Is Out: Apache 2.0, 3.8B Active Params, and the Best Local Model in 2026

Google dropped Gemma 4 on April 2 with four variants, a 256K context window, and — finally — an Apache 2.0 license. The 26B MoE activates only 3.8B params at inference. Here's what changed, what it means for local AI, and how it stacks up.

April 3, 2026·12 min read·1,862 words

Google's Gemma 3 launched over a year ago and stayed competitive longer than most expected. Now Gemma 4 is here, and it's a meaningful step up — not just in raw performance, but in one area that mattered more than any benchmark: the license.

Starting with Gemma 4, Google has dropped its custom Gemma Terms of Use and switched to Apache 2.0. That single change may matter more for enterprise adoption than every performance gain in the release. This shift aligns well with the broader trends in AI licensing that prioritize flexibility and community-driven development, as discussed in AI Agent Guardrails & Output Validation in 2026: Tools, Patterns & Best Practices.

Here's the full breakdown: what's in Gemma 4, why Apache 2.0 is such a big deal, and how to run it today.

What Is Gemma 4?

Gemma 4 is Google's latest family of open-weight large language models, released April 2, 2026. Built on the same underlying architecture as the closed Gemini 3 models, it comes in four variants optimized for different hardware tiers — from data center GPUs to smartphones. This multi-faceted approach mirrors the strategies outlined in Multi-Agent Orchestration: A Practical Guide for 2026, emphasizing adaptability and performance across various platforms.

The four models:

Model Params Active at Inference Context Target Hardware
Gemma 4 31B Dense 31B 31B 256K 80GB H100 (unquantized) or consumer GPU (quantized)
Gemma 4 26B MoE 26B 3.8B 256K 80GB H100 (unquantized) or consumer GPU (quantized)
Gemma 4 E4B ~4B effective ~4B 128K Smartphones, Raspberry Pi, Jetson Nano
Gemma 4 E2B ~2B effective ~2B 128K Smartphones, low-power edge devices

The standout number is the 26B MoE: 3.8 billion active parameters during inference. Despite 26 billion total parameters in the model weights, inference only touches a small fraction of them on any given forward pass — which means tokens-per-second far higher than a comparably sized dense model. This efficiency is crucial for deploying models like Gemma 4 in resource-constrained environments, as highlighted in MCP Is Not Dead: Why Server-Side MCP Changes Everything for AI Agents.

The Apache 2.0 Shift: Why It Actually Matters

Gemma 3 shipped with a custom Google license that caused real problems for enterprise developers:

  • Prohibited-use policy that Google could update unilaterally — your legal compliance cou

ld change without any action on your part

  • Required developers to enforce Google's rules across all downstream projects — if you built on Gemma and someone used your product for a prohibited use, you were potentially liable
  • Synthetic data transfer clause — a reading of the license suggested that models trained on data generated by Gemma might be subject to Gemma's license terms, which killed use of Gemma for data augmentation pipelines
  • No commercial clarity — legal teams at enterprises routinely flagged the custom license as too uncertain to build on

Apache 2.0 eliminates all of this. Apache 2.0 is one of the most widely understood open-source licenses in existence. It's permissive, commercially friendly, and — crucially — Google can't unilaterally change it. If you build a product on Gemma 4 under Apache 2.0, the terms are locked in.

This is the license change that was blocking significant enterprise adoption of Gemma. It's now resolved.

Performance: Where Gemma 4 Actually Lands

Google claims Gemma 4 31B Dense launches at #3 on the LMSYS Chatbot Arena open-model leaderboard, behind only GLM-5 and Kimi 2.5. Those two models are dramatically larger — the ELO score per parameter count is where Gemma 4 is genuinely exceptional.

Key capabilities:

  • Reasoning and math: Improved over Gemma 3, built on Gemini 3's architecture
  • Code generation: Google claims competitive with Gemini Pro and Claude Code on quality, with the advantage of running entirely locally (offline, private, no API costs)
  • Vision/OCR: Better at processing visual input — chart understanding, document OCR — compared to Gemma 3
  • Structured output: Native JSON output support, important for agentic workflows
  • Function calling: Native support for tool/function call formats — agents can use Gemma 4 as a local reasoning backbone

All four models support 140+ languages and include native structured JSON output, function calling, and common tool/API instruction formats.

The MoE Architecture: How 26B Becomes 3.8B

Mixture of Experts (MoE) is worth understanding because it fundamentally changes the inference math.

In a standard dense model, every parameter is involved in every forward pass. A 26B dense model processes a token through all 26B parameters — memory bandwidth and compute scale linearly with parameter count.

In an MoE model, the network has many "expert" sub-networks, but each token is routed to only a small subset of them. Gemma 4 26B MoE activates roughly 3.8B parameters per forward pass despite having 26B in total.

Practical implications:

  • Memory: You still need to load all 26B parameters into RAM/VRAM (the full model weights must be present)
  • Compute: Each token only passes through ~3.8B params worth of computation, so throughput is much higher
  • Quality: The model has "seen" 26B params worth of specialization during training; routing selects the relevant expertise for each token

For local inference, MoE is excellent when VRAM isn't the bottleneck. If you have the memory to hold the model, tokens-per-second will feel closer to a 4B dense model than a 26B dense model.

The Edge Models: E2B and E4B

The two edge variants — Effective 2B and Effective 4B — are designed for on-device deployment. "Effective" refers to the active parameter count during inference, similar to the MoE logic.

Target hardware:

  • Raspberry Pi 4/5
  • NVIDIA Jetson Nano
  • Qualcomm Snapdragon (optimized in collaboration with Qualcomm)
  • MediaTek Dimensity (MediaTek-optimized)
  • Android smartphones

Key capabilities of E2B/E4B:

  • 128K context window — substantial for edge devices
  • Near-zero latency — Google's claim, validated by the on-device optimization work with Qualcomm and MediaTek
  • Speech recognition — native support, improved from Gemma 3n
  • Lower memory/battery than Gemma 3 equivalents

The E-series models are also the foundation for Gemini Nano 4 — Google confirmed to Ars Technica that the next-gen Pixel on-device AI (Gemini Nano 4) will be based on Gemma 4 E2B and E4B. Developers building with E2B/E4B today will be forward-compatible with Gemini Nano 4 when it ships.

How to Run Gemma 4 Today

Ollama (easiest for local dev)


# Pull and run Gemma 4 26B MoE
ollama pull gemma4:26b-moe

# Or the 31B Dense
ollama pull gemma4:31b

# Run
ollama run gemma4:26b-moe

Models are available at ollama.com/library/gemma4.

Hugging Face

Full model weights with all quantization variants:

google/gemma-4-31b-it and google/gemma-4-26b-moe-it (instruction-tuned)

google/gemma-4-e4b-it and google/gemma-4-e2b-it

Available at huggingface.co/collections/google/gemma-4.

Google AI Studio

Cloud-hosted inference for 31B Dense and 26B MoE — free tier available. No local hardware required. Good for evaluation before committing to local deployment.

Kaggle

Full model weights via Kaggle Models — useful if you're already in the Kaggle ecosystem.

Quantization and Hardware Requirements

Unquantized (bfloat16) requirements:

  • 31B Dense: ~62GB VRAM → requires 80GB H100 or multi-GPU setup
  • 26B MoE: ~52GB VRAM → same requirement, but inference compute is much lower
  • E4B: ~8GB VRAM → consumer GPU territory (RTX 3060 and above)
  • E2B: ~4GB VRAM → most modern consumer GPUs

With INT4 quantization (GGUF via llama.cpp / Ollama):

  • 31B Dense: ~16-20GB → fits in a 24GB consumer GPU (RTX 3090, 4090, etc.)
  • 26B MoE: ~13-16GB → same consumer GPU range
  • E4B: ~2-3GB → runs on CPU with reasonable performance

For the Forge/Berserki setup (RTX 5090 32GB / RTX 5060 Ti 16GB): the 26B MoE at INT4/INT8 is the practical sweet spot — full-quality reasoning at fast throughput.

Gemma 4 vs. The Field

Model Params License Context Local? Notable
Gemma 4 31B Dense 31B Apache 2.0 256K #3 Arena, Gemini 3 architecture
Gemma 4 26B MoE 26B (3.8B active) Apache 2.0 256K Fast inference, MoE efficiency
Qwen 2.5-Coder 32B 32B Apache 2.0 128K Best local coding model
Qwen 3.6-Plus unknown Alibaba custom 1M ❌ (API) Frontier-level, agentic
Llama 3.3 70B 70B Meta custom 128K ✅ (large GPU) Strong all-around
Mistral Large 2 123B Mistral custom 128K ✅ (multi-GPU) Enterprise focus

Gemma 4's combination of Apache 2.0 licensing + 256K context + #3 Arena position makes it the clearest recommendation for teams that need a capable open-weight model for agentic workflows, coding assistance, or private inference.

Who Should Use Gemma 4

Local AI developers: If you're running a local LLM stack for private inference (no cloud, no API costs, no data leaving your machine), Gemma 4 26B MoE is now the strongest option in its hardware class.

Enterprise teams: The Apache 2.0 license removes the legal uncertainty that blocked Gemma 3 adoption. Build on it, fine-tune it, ship products based on it — the license is clean.

Agent builders: Native function calling and structured JSON output means Gemma 4 works as a local reasoning backbone for tool-using agents. No API dependency, full offline capability.

Edge/mobile developers: E2B/E4B with Qualcomm/MediaTek optimization and forward-compatibility with Gemini Nano 4 makes these the best-supported edge models Google has shipped.

Teams running code generation: Google claims competitive performance with Gemini Pro and Claude Code on code quality. For offline/private code gen (legal code, proprietary systems, airgapped environments), Gemma 4 31B is now a serious option.

The Bottom Line

Gemma 4 is the version of Gemma that Google should have shipped a year ago. The Apache 2.0 license alone resolves the single biggest adoption blocker for enterprise developers. The technical improvements — 256K context, MoE efficiency, improved reasoning, agent-native features, edge model improvements — are real additions on top of a now-clean licensing foundation.

For local AI deployment in 2026, Gemma 4 26B MoE is the new default recommendation. Fast, capable, truly open.

Start here: Gemma 4 on Ollama | Gemma 4 on Hugging Face | Gemma 4 in AI Studio

FAQ

What is Gemma 4?

Gemma 4 is Google's latest family of open-weight large language models, released April 2, 2026. It comes in four variants: 31B Dense, 26B MoE, E4B, and E2B. All are based on the same architecture as Google's closed Gemini 3 models and are now licensed under Apache 2.0.

Why is the Apache 2.0 license important for Gemma 4?

Previous Gemma versions used a custom Google license with restrictive terms that Google could change unilaterally. Apache 2.0 is one of the most permissive and legally clear open-source licenses — no commercial restrictions, no overbearing use policies, and Google can't retroactively change the terms for code you've already shipped.

What does "3.8B active parameters" mean for the 26B MoE model?

The 26B MoE uses a Mixture of Experts architecture. Although 26 billion parameters exist in the model weights, each token only passes through ~3.8 billion of them during inference. This means inference speed is comparable to a 4B dense model while maintaining the quality of a 26B trained model.

What hardware do I need to run Gemma 4?

Quantized (INT4): 31B Dense needs ~16-20GB VRAM (RTX 3090/4090), E4B needs ~3GB VRAM (most consumer GPUs). Unquantized: 31B and 26B require an 80GB H100. The E2B/E4B edge models run on Raspberry Pi, Jetson Nano, and modern smartphones.

Where can I download Gemma 4?

From Hugging Face, Kaggle, and Ollama. Cloud inference is available in Google AI Studio.

Frequently Asked Questions

What is Gemma 4?
Gemma 4 is Google's latest family of open-weight large language models, released April 2, 2026. It comes in four variants: 31B Dense, 26B MoE, E4B, and E2B. All are based on the same architecture as Google's closed Gemini 3 models and are now licensed under Apache 2.0.
Why is the Apache 2.0 license important for Gemma 4?
Previous Gemma versions used a custom Google license with restrictive terms that Google could change unilaterally. Apache 2.0 is one of the most permissive and legally clear open-source licenses — no commercial restrictions, no overbearing use policies, and Google can't retroactively change the terms for code you've already shipped.
What does "3.8B active parameters" mean for the 26B MoE model?
The 26B MoE uses a Mixture of Experts architecture. Although 26 billion parameters exist in the model weights, each token only passes through 3.8 billion of them during inference. This means inference speed is comparable to a 4B dense model while maintaining the quality of a 26B trained model.
What hardware do I need to run Gemma 4?
Quantized (INT4): 31B Dense needs 16-20GB VRAM (RTX 3090/4090), E4B needs 3GB VRAM (most consumer GPUs). Unquantized: 31B and 26B require an 80GB H100. The E2B/E4B edge models run on Raspberry Pi, Jetson Nano, and modern smartphones.
Where can I download Gemma 4?
From Hugging Face, Kaggle, and Ollama. Cloud inference is available in Google AI Studio.

🔧 Tools in This Article

All tools →

Related Guides

All guides →
#local-ai#llm#google#open-source#agent#coding#free