AI Tools

NVIDIA Nemotron 3: Complete Guide to Super, Nano, and GenRM (2026)

NVIDIA's Nemotron 3 family explained: Super (120B), Nano (30B), and GenRM reward model. Specs, benchmarks, architecture, and how they compare to Qwen, GPT-OSS, and Llama.

March 13, 2026·9 min read·1,884 words

NVIDIA isn't just making the GPUs that run AI models — they're building the models themselves. The Nemotron 3 family is their most ambitious open-weight release yet: three models designed for different scales, from edge devices to multi-agent production systems.

Here's the full breakdown: what each model does, how they compare to competitors, and who should actually use them.

The Nemotron 3 Family at a Glance

Model	Total Params	Active Params	Architecture	Context	Release	License
Nemotron 3 Super	120B	12B	Hybrid Mamba-Transformer MoE	1M tokens	March 2026	Open weights
Nemotron 3 Nano	31.6B	3.2B	Hybrid Mamba-Transformer MoE	1M tokens	December 2025	Open weights
Qwen3-Nemotron-235B-A22B-GenRM	235B	22B	Transformer (Qwen3)	128K tokens	March 2026	Apache 2.0
Nemotron 3 Ultra	~500B	~50B	TBA	TBA	Coming soon	TBA

All three are open weight. All use Mixture-of-Experts (MoE), meaning only a fraction of parameters activate per token — keeping inference fast despite large total parameter counts.

Nemotron 3 Super (120B-A12B): The Agentic Workhorse

Super is the headline model. It's built for one purpose: running as the brain of autonomous AI agents at scale. And the stakes are real — companies are already cutting knowledge workers and citing AI agents as the replacement. Models like Super are what makes that possible.

Why It Matters

Multi-agent systems have a scaling problem. Agents re-send conversation history, tool outputs, and reasoning steps every turn — generating up to 15x more tokens than a standard chat interaction. This "context explosion" causes goal drift over long tasks, and using massive reasoning models for every sub-task makes things expensive and slow.

Super solves this with efficiency-first architecture design.

Architecture Deep Dive

Hybrid Mamba-Transformer MoE. Super interleaves three layer types:

Mamba-2 layers handle most sequence processing with linear-time complexity. This is what makes the 1M-token context window practical, not theoretical.
Transformer attention layers are interleaved at key depths for precise associative recall — finding specific facts buried in long contexts.
MoE layers scale effective capacity without dense computation costs. Only a subset of experts activates per token.

Latent MoE. Before routing decisions, token embeddings are compressed into a low-rank latent space. Expert computation happens in this smaller dimension. The result: 4x as many expert specialists can be consulted for the same inference cost. This enables fine-grained specialization — distinct experts for Python syntax vs. SQL logic vs. conversational reasoning.

Multi-Token Prediction (MTP). Instead of predicting one token at a time, Super forecasts several future tokens simultaneously. This improves reasoning quality during training (the model must learn longer-range logical dependencies) and enables built-in speculative decoding at inference for up to 3x speedups on structured generation.

Native NVFP4 Pretraining. Super was trained natively in NVIDIA's 4-bit floating-point format, optimized for Blackwell GPUs. No post-training quantization loss — the model learns to be accurate within 4-bit constraints from the first gradient update.

Training Pipeline

1. Pretraining: 25 trillion tokens (10T unique), spanning code, math, science, and general knowledge

2. Supervised fine-tuning: ~7 million samples from a 40M-sample corpus covering reasoning, coding, safety, and agent tasks

3. Multi-environment RL: 1.2 million+ environment rollouts across 21 configurations using NeMo Gym and NeMo RL

Benchmark Results

Benchmark	Nemotron 3 Super	GPT-OSS-120B	Qwen3.5-122B
Artificial Analysis Intelligence Index	36	33	42
PinchBench (agentic, full suite)	85.6%	—	—
GDPval-AA (agentic ELO)	1027	—	—
Terminal-Bench Hard	29%	—	—
DeepResearch Bench	#1 (open models)	—	—
RULER 1M context	Outperforms both	✓	✓

Throughput: Up to 2.2x higher than GPT-OSS-120B and 7.5x higher than Qwen3.5-122B on 8K input / 16K output. Provider APIs show speeds up to 449–484 tokens/second.

Efficiency vs. intelligence trade-off: 11% higher throughput per NVIDIA B200 GPU than GPT-OSS-120B, with comparable or higher accuracy. Qwen3.5-122B scores 6 points higher on intelligence but at 40% lower throughput per GPU.

Available Formats

NVFP4 (optimized for Blackwell)
FP8
BF16
Base model (BF16, pre-post-training)

All on Hugging Face and via NVIDIA NIM.

Nemotron 3 Nano (30B-A3B): Edge-Ready Efficiency

Nano is the small model that punches above its weight. Released in December 2025, it's designed for local deployment on consumer and edge hardware.

Key Specs

31.6B total, 3.2B active parameters (3.6B with embeddings)
Hybrid Mamba-Transformer MoE — same architectural family as Super
1M token context window
Configurable reasoning depth at inference time

Benchmark Highlights

Benchmark	Nemotron 3 Nano	Qwen3-30B-A3B	GPT-OSS-20B
MATH	82.88%	61.14%	—
HumanEval (code)	78.05%	—	—
RULER 64K (long context)	87.5%	—	—
RULER 1M	86.3%	—	—
Inference throughput (vs peers)	Baseline	3.3x slower	2.2x slower

Nano dominates on math reasoning and long-context tasks while activating less than half the parameters of its predecessor (Nemotron 2 Nano). The 3.3x throughput advantage over Qwen3-30B-A3B comes from the Mamba-2 architecture's linear-time sequence processing.

Who Should Use Nano

Local LLM enthusiasts running on consumer GPUs (RTX 4090/5090 territory with FP8)
Edge deployments where latency and cost-per-token matter more than peak intelligence
High-volume engineering tasks where throughput is the governing constraint
Agent sub-tasks that don't need Super's full reasoning capacity — see guardrails for agent I/O for practical examples

Available Formats

FP8
BF16
Base model (BF16)

Qwen3-Nemotron-235B-A22B-GenRM: The Judge Model

This is the most specialized — and most misunderstood — model in the family. GenRM is not a chat model. It's a Generative Reward Model used to train the other Nemotron 3 models via RLHF.

What It Does

Given a conversation history, a user request, and two candidate responses, GenRM:

1. Scores each response individually on helpfulness (1–5 scale)

2. Produces a comparative ranking (1–6 scale, from "Response 1 is much better" to "Response 2 is much better")

Architecture

Built on Qwen3-235B-A22B-Thinking-2507 as foundation
235B total, 22B active parameters
Fine-tuned with GRPO algorithm on preference data from HelpSteer3 and Arena Human Preference datasets
128K token context window
Requires 8x GPU tensor parallelism for serving

Why It Matters

Traditional reward models use a Bradley-Terry framework that tends to overfit — leading to "reward hacking" where models learn to game the scoring rather than genuinely improving. GenRM uses a generative approach: it reasons through its evaluation in natural language before scoring, which generalizes better across tasks and reduces reward hacking during RLHF.

NVIDIA used GenRM to train both Nemotron 3 Super and Nano. By releasing it openly, they're enabling the community to replicate and extend their training pipeline.

Who Should Use GenRM

Model trainers doing RLHF or DPO on their own models
Evaluation pipelines needing automated quality assessment
Research teams studying reward model behavior

This is not a model you'd deploy for chat or code generation. It's infrastructure for building better models.

How Nemotron 3 Compares to the Competition

vs. Qwen3.5 (122B-A10B)

Qwen3.5 scores higher on raw intelligence benchmarks (+6 points on Artificial Analysis Index) but at 40% lower throughput per GPU. If you're running many concurrent agents or processing high volumes, Super's efficiency advantage compounds fast. For teams choosing between Qwen versions for local deployment, see our Qwen 3.5 vs Qwen 2.5 benchmark comparison and should you upgrade to Qwen 3.5 guides. For teams choosing between Qwen versions for local deployment, see our Qwen 3.5 vs Qwen 2.5 benchmark comparison and should you upgrade to Qwen 3.5 guides.

vs. GPT-OSS-120B

Super matches or exceeds GPT-OSS-120B on accuracy while delivering 11% higher throughput per B200 GPU and 2.2x higher throughput on the standard 8K/16K test. Super also offers significantly more open training data and methodology disclosure.

vs. Llama (Meta)

Different design philosophy entirely. Meta's Llama models are dense Transformers. Nemotron 3 uses hybrid Mamba-Transformer MoE — trading peak single-query performance for dramatically better throughput at scale. For agentic workloads running many sessions concurrently, the MoE approach has clear advantages.

vs. Previous Nemotron (Super 1.0)

The new Super delivers 5x higher inference throughput than the previous Nemotron Super, with comparable or better accuracy. If you deployed the earlier version, upgrade is straightforward — same model APIs, better everything.

Getting Started

Try It Now (API)

The fastest path: use NVIDIA NIM for instant API access. Also available from Lightning AI and DeepInfra.

Self-Host

Super requires serious hardware — 8x B200 or equivalent for full precision. NVFP4 quantization brings it down to a single multi-GPU node on Blackwell.

Nano is more accessible. FP8 runs on a single H200 or (with aggressive quantization) on high-end consumer GPUs. The DGX Spark can also handle Nano with its 128 GB unified memory.

Both models work with vLLM out of the box:


# Nemotron 3 Super (FP8)
python3 -m vllm.entrypoints.openai.api_server \
  --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8 \
  --trust-remote-code \
  --tensor-parallel-size 8

# Nemotron 3 Nano (FP8)
python3 -m vllm.entrypoints.openai.api_server \
  --model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
  --trust-remote-code

Training Recipes

NVIDIA released everything: pretraining data, SFT datasets, RL environments, and full model recipes via the Nemotron Developer Repository.

The Bottom Line

Nemotron 3 is NVIDIA's statement that open models can be both competitive *and* efficient. Super isn't the smartest open model (Qwen3.5-122B edges it on benchmarks), but it's the most practical for agentic production workloads — and it's the most intelligent model ever released at this level of openness.

Choose Super if you're running multi-agent systems, need 1M context, or care about throughput-per-GPU economics.

Choose Nano if you need a fast, capable model for local/edge deployment that handles long context and reasoning well.

Use GenRM if you're training your own models and want NVIDIA's RLHF infrastructure.

The Ultra model (roughly 500B total, 50B active) is still coming. When it arrives, the Nemotron 3 family will cover everything from edge to frontier.

*Compare Nemotron 3 with 100+ other AI tools → toolhalla.ai/models*

*Run it locally? Check our Ollama setup guides and GPU hardware recommendations.*

FAQ

What is NVIDIA Nemotron and who is it for?

NVIDIA Nemotron is a family of LLMs optimized for enterprise use — customer service, instruction following, and running efficiently on NVIDIA hardware. Nemotron-3 8B is designed to run fast on a single GPU. It's for companies wanting NVIDIA-optimized models for their AI infrastructure.

How does Nemotron compare to Llama 3?

Nemotron-3 8B and Llama 3.1 8B are close in quality on standard benchmarks. Nemotron has an edge on instruction-following and is more thoroughly tested on NVIDIA's TensorRT-LLM stack. For general use, Llama 3.1 8B has better community support and more quantized variants available.

Can I run Nemotron locally?

Yes — Nemotron-3 8B is available on Hugging Face in GGUF format. It runs via Ollama, LM Studio, or llama.cpp. At Q4 quantization, it needs ~5GB VRAM — an RTX 3060 12GB runs it comfortably at 40-60 tok/s.

What is Nemotron best used for?

Nemotron excels at structured output, instruction following, and customer-facing applications. It's been specifically tuned for safety and helpfulness in business contexts. Not the best choice for creative writing or open-ended reasoning — Qwen 3 or Llama 3.3 are stronger there.

Is Nemotron available as a free API?

Yes — NVIDIA's API Catalog (build.nvidia.com) provides free API access to Nemotron models with generous rate limits for development. No NVIDIA GPU is required to use the API. This is a good way to test Nemotron before committing to local deployment.

Recommended Hardware

Frequently Asked Questions

What is NVIDIA Nemotron and who is it for?

How does Nemotron compare to Llama 3?

Can I run Nemotron locally?

Yes — Nemotron-3 8B is available on Hugging Face in GGUF format. It runs via Ollama, LM Studio, or llama.cpp. At Q4 quantization, it needs 5GB VRAM — an RTX 3060 12GB runs it comfortably at 40-60 tok/s.

What is Nemotron best used for?

Is Nemotron available as a free API?

🔧 Tools in This Article

Make (Integromat)

Hugging Face

Replicate

LM Studio

Ollama

vLLM

Related Guides

All guides →

AI Tools

Meta and Broadcom April 2026: Why Custom AI Silicon Matters More Now

Meta and Broadcom April 2026: Why Custom AI Silicon Matters More Now Meta's April 14, 2026 announcement of an expanded Broadcom partnership is a useful reminder that AI competition is increasingly fought below the API layer. Meta said it...

2 min read

AI Tools

Meta Muse Spark April 2026: What It Means for Consumer AI Assistants

Meta Muse Spark April 2026: What It Means for Consumer AI Assistants Meta's April 8, 2026 announcement of Muse Spark matters because it is not just another model launch. Meta is trying to reposition Meta AI around multimodal perception,...

2 min read

AI Tools

Project Glasswing April 2026: The AI Cybersecurity Shift Is Here

Project Glasswing April 2026: The AI Cybersecurity Shift Is Here Anthropic's April 7, 2026 announcement of Project Glasswing is one of the clearest recent signs that frontier AI labs now see cybersecurity as a central deployment battleground, not a...

2 min read

The Nemotron 3 Family at a Glance

Nemotron 3 Super (120B-A12B): The Agentic Workhorse

Why It Matters

Architecture Deep Dive

Training Pipeline

Benchmark Results

Available Formats

Nemotron 3 Nano (30B-A3B): Edge-Ready Efficiency

Key Specs

Benchmark Highlights

Who Should Use Nano

Available Formats

Qwen3-Nemotron-235B-A22B-GenRM: The Judge Model

What It Does

Architecture

Why It Matters

Who Should Use GenRM

How Nemotron 3 Compares to the Competition

vs. Qwen3.5 (122B-A10B)

vs. GPT-OSS-120B

vs. Llama (Meta)

vs. Previous Nemotron (Super 1.0)

Getting Started

Try It Now (API)

Self-Host

Training Recipes

The Bottom Line

FAQ

What is NVIDIA Nemotron and who is it for?

How does Nemotron compare to Llama 3?

Can I run Nemotron locally?

What is Nemotron best used for?

Is Nemotron available as a free API?

Recommended Hardware

Recommended Products

Frequently Asked Questions

🔧 Tools in This Article

Related Guides

Meta and Broadcom April 2026: Why Custom AI Silicon Matters More Now

Meta Muse Spark April 2026: What It Means for Consumer AI Assistants

Project Glasswing April 2026: The AI Cybersecurity Shift Is Here