AI Tools

NVIDIA Nemotron 3: Complete Guide to Super, Nano, and GenRM (2026)

NVIDIA's Nemotron 3 family explained: Super (120B), Nano (30B), and GenRM reward model. Specs, benchmarks, architecture, and how they compare to Qwen, GPT-OSS, and Llama.

March 13, 2026·9 min read·1,884 words

NVIDIA isn't just making the GPUs that run AI models — they're building the models themselves. The Nemotron 3 family is their most ambitious open-weight release yet: three models designed for different scales, from edge devices to multi-agent production systems.

Here's the full breakdown: what each model does, how they compare to competitors, and who should actually use them.

The Nemotron 3 Family at a Glance

Model Total Params Active Params Architecture Context Release License
Nemotron 3 Super 120B 12B Hybrid Mamba-Transformer MoE 1M tokens March 2026 Open weights
Nemotron 3 Nano 31.6B 3.2B Hybrid Mamba-Transformer MoE 1M tokens December 2025 Open weights
Qwen3-Nemotron-235B-A22B-GenRM 235B 22B Transformer (Qwen3) 128K tokens March 2026 Apache 2.0
Nemotron 3 Ultra ~500B ~50B TBA TBA Coming soon TBA

All three are open weight. All use Mixture-of-Experts (MoE), meaning only a fraction of parameters activate per token — keeping inference fast despite large total parameter counts.

Nemotron 3 Super (120B-A12B): The Agentic Workhorse

Super is the headline model. It's built for one purpose: running as the brain of autonomous AI agents at scale. And the stakes are real — companies are already cutting knowledge workers and citing AI agents as the replacement. Models like Super are what makes that possible.

Why It Matters

Multi-agent systems have a scaling problem. Agents re-send conversation history, tool outputs, and reasoning steps every turn — generating up to 15x more tokens than a standard chat interaction. This "context explosion" causes goal drift over long tasks, and using massive reasoning models for every sub-task makes things expensive and slow.

Super solves this with efficiency-first architecture design.

Architecture Deep Dive

Hybrid Mamba-Transformer MoE. Super interleaves three layer types:

  • Mamba-2 layers handle most sequence processing with linear-time complexity. This is what makes the 1M-token context window practical, not theoretical.
  • Transformer attention layers are interleaved at key depths for precise associative recall — finding specific facts buried in long contexts.
  • MoE layers scale effective capacity without dense computation costs. Only a subset of experts activates per token.

Latent MoE. Before routing decisions, token embeddings are compressed into a low-rank latent space. Expert computation happens in this smaller dimension. The result: 4x as many expert specialists can be consulted for the same inference cost. This enables fine-grained specialization — distinct experts for Python syntax vs. SQL logic vs. conversational reasoning.

Multi-Token Prediction (MTP). Instead of predicting one token at a time, Super forecasts several future tokens simultaneously. This improves reasoning quality during training (the model must learn longer-range logical dependencies) and enables built-in speculative decoding at inference for up to 3x speedups on structured generation.

Native NVFP4 Pretraining. Super was trained natively in NVIDIA's 4-bit floating-point format, optimized for Blackwell GPUs. No post-training quantization loss — the model learns to be accurate within 4-bit constraints from the first gradient update.

Training Pipeline

1. Pretraining: 25 trillion tokens (10T unique), spanning code, math, science, and general knowledge

2. Supervised fine-tuning: ~7 million samples from a 40M-sample corpus covering reasoning, coding, safety, and agent tasks

3. Multi-environment RL: 1.2 million+ environment rollouts across 21 configurations using NeMo Gym and NeMo RL

Benchmark Results

Benchmark Nemotron 3 Super GPT-OSS-120B Qwen3.5-122B
Artificial Analysis Intelligence Index 36 33 42
PinchBench (agentic, full suite) 85.6%
GDPval-AA (agentic ELO) 1027
Terminal-Bench Hard 29%
DeepResearch Bench #1 (open models)
RULER 1M context Outperforms both

Throughput: Up to 2.2x higher than GPT-OSS-120B and 7.5x higher than Qwen3.5-122B on 8K input / 16K output. Provider APIs show speeds up to 449–484 tokens/second.

Efficiency vs. intelligence trade-off: 11% higher throughput per NVIDIA B200 GPU than GPT-OSS-120B, with comparable or higher accuracy. Qwen3.5-122B scores 6 points higher on intelligence but at 40% lower throughput per GPU.

Available Formats

  • NVFP4 (optimized for Blackwell)
  • FP8
  • BF16
  • Base model (BF16, pre-post-training)

All on Hugging Face and via NVIDIA NIM.

Nemotron 3 Nano (30B-A3B): Edge-Ready Efficiency

Nano is the small model that punches above its weight. Released in December 2025, it's designed for local deployment on consumer and edge hardware.

Key Specs

  • 31.6B total, 3.2B active parameters (3.6B with embeddings)
  • Hybrid Mamba-Transformer MoE — same architectural family as Super
  • 1M token context window
  • Configurable reasoning depth at inference time

Benchmark Highlights

Benchmark Nemotron 3 Nano Qwen3-30B-A3B GPT-OSS-20B
MATH 82.88% 61.14%
HumanEval (code) 78.05%
RULER 64K (long context) 87.5%
RULER 1M 86.3%
Inference throughput (vs peers) Baseline 3.3x slower 2.2x slower

Nano dominates on math reasoning and long-context tasks while activating less than half the parameters of its predecessor (Nemotron 2 Nano). The 3.3x throughput advantage over Qwen3-30B-A3B comes from the Mamba-2 architecture's linear-time sequence processing.

Who Should Use Nano

  • Local LLM enthusiasts running on consumer GPUs (RTX 4090/5090 territory with FP8)
  • Edge deployments where latency and cost-per-token matter more than peak intelligence
  • High-volume engineering tasks where throughput is the governing constraint
  • Agent sub-tasks that don't need Super's full reasoning capacity — see guardrails for agent I/O for practical examples

Available Formats

  • FP8
  • BF16
  • Base model (BF16)

Qwen3-Nemotron-235B-A22B-GenRM: The Judge Model

This is the most specialized — and most misunderstood — model in the family. GenRM is not a chat model. It's a Generative Reward Model used to train the other Nemotron 3 models via RLHF.

What It Does

Given a conversation history, a user request, and two candidate responses, GenRM:

1. Scores each response individually on helpfulness (1–5 scale)

2. Produces a comparative ranking (1–6 scale, from "Response 1 is much better" to "Response 2 is much better")

Architecture

  • Built on Qwen3-235B-A22B-Thinking-2507 as foundation
  • 235B total, 22B active parameters
  • Fine-tuned with GRPO algorithm on preference data from HelpSteer3 and Arena Human Preference datasets
  • 128K token context window
  • Requires 8x GPU tensor parallelism for serving

Why It Matters

Traditional reward models use a Bradley-Terry framework that tends to overfit — leading to "reward hacking" where models learn to game the scoring rather than genuinely improving. GenRM uses a generative approach: it reasons through its evaluation in natural language before scoring, which generalizes better across tasks and reduces reward hacking during RLHF.

NVIDIA used GenRM to train both Nemotron 3 Super and Nano. By releasing it openly, they're enabling the community to replicate and extend their training pipeline.

Who Should Use GenRM

  • Model trainers doing RLHF or DPO on their own models
  • Evaluation pipelines needing automated quality assessment
  • Research teams studying reward model behavior

This is not a model you'd deploy for chat or code generation. It's infrastructure for building better models.

How Nemotron 3 Compares to the Competition

vs. Qwen3.5 (122B-A10B)

Qwen3.5 scores higher on raw intelligence benchmarks (+6 points on Artificial Analysis Index) but at 40% lower throughput per GPU. If you're running many concurrent agents or processing high volumes, Super's efficiency advantage compounds fast. For teams choosing between Qwen versions for local deployment, see our Qwen 3.5 vs Qwen 2.5 benchmark comparison and should you upgrade to Qwen 3.5 guides. For teams choosing between Qwen versions for local deployment, see our Qwen 3.5 vs Qwen 2.5 benchmark comparison and should you upgrade to Qwen 3.5 guides.

vs. GPT-OSS-120B

Super matches or exceeds GPT-OSS-120B on accuracy while delivering 11% higher throughput per B200 GPU and 2.2x higher throughput on the standard 8K/16K test. Super also offers significantly more open training data and methodology disclosure.

vs. Llama (Meta)

Different design philosophy entirely. Meta's Llama models are dense Transformers. Nemotron 3 uses hybrid Mamba-Transformer MoE — trading peak single-query performance for dramatically better throughput at scale. For agentic workloads running many sessions concurrently, the MoE approach has clear advantages.

vs. Previous Nemotron (Super 1.0)

The new Super delivers 5x higher inference throughput than the previous Nemotron Super, with comparable or better accuracy. If you deployed the earlier version, upgrade is straightforward — same model APIs, better everything.

Getting Started

Try It Now (API)

The fastest path: use NVIDIA NIM for instant API access. Also available from Lightning AI and DeepInfra.

Self-Host

Super requires serious hardware — 8x B200 or equivalent for full precision. NVFP4 quantization brings it down to a single multi-GPU node on Blackwell.

Nano is more accessible. FP8 runs on a single H200 or (with aggressive quantization) on high-end consumer GPUs. The DGX Spark can also handle Nano with its 128 GB unified memory.

Both models work with vLLM out of the box:


# Nemotron 3 Super (FP8)
python3 -m vllm.entrypoints.openai.api_server \
  --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8 \
  --trust-remote-code \
  --tensor-parallel-size 8

# Nemotron 3 Nano (FP8)
python3 -m vllm.entrypoints.openai.api_server \
  --model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
  --trust-remote-code

Training Recipes

NVIDIA released everything: pretraining data, SFT datasets, RL environments, and full model recipes via the Nemotron Developer Repository.

The Bottom Line

Nemotron 3 is NVIDIA's statement that open models can be both competitive *and* efficient. Super isn't the smartest open model (Qwen3.5-122B edges it on benchmarks), but it's the most practical for agentic production workloads — and it's the most intelligent model ever released at this level of openness.

Choose Super if you're running multi-agent systems, need 1M context, or care about throughput-per-GPU economics.

Choose Nano if you need a fast, capable model for local/edge deployment that handles long context and reasoning well.

Use GenRM if you're training your own models and want NVIDIA's RLHF infrastructure.

The Ultra model (roughly 500B total, 50B active) is still coming. When it arrives, the Nemotron 3 family will cover everything from edge to frontier.


*Compare Nemotron 3 with 100+ other AI tools → toolhalla.ai/models*

*Run it locally? Check our Ollama setup guides and GPU hardware recommendations.*

FAQ

What is NVIDIA Nemotron and who is it for?

NVIDIA Nemotron is a family of LLMs optimized for enterprise use — customer service, instruction following, and running efficiently on NVIDIA hardware. Nemotron-3 8B is designed to run fast on a single GPU. It's for companies wanting NVIDIA-optimized models for their AI infrastructure.

How does Nemotron compare to Llama 3?

Nemotron-3 8B and Llama 3.1 8B are close in quality on standard benchmarks. Nemotron has an edge on instruction-following and is more thoroughly tested on NVIDIA's TensorRT-LLM stack. For general use, Llama 3.1 8B has better community support and more quantized variants available.

Can I run Nemotron locally?

Yes — Nemotron-3 8B is available on Hugging Face in GGUF format. It runs via Ollama, LM Studio, or llama.cpp. At Q4 quantization, it needs ~5GB VRAM — an RTX 3060 12GB runs it comfortably at 40-60 tok/s.

What is Nemotron best used for?

Nemotron excels at structured output, instruction following, and customer-facing applications. It's been specifically tuned for safety and helpfulness in business contexts. Not the best choice for creative writing or open-ended reasoning — Qwen 3 or Llama 3.3 are stronger there.

Is Nemotron available as a free API?

Yes — NVIDIA's API Catalog (build.nvidia.com) provides free API access to Nemotron models with generous rate limits for development. No NVIDIA GPU is required to use the API. This is a good way to test Nemotron before committing to local deployment.

  • NVIDIA GeForce RTX 5090 — A high-performance GPU that can handle the computational demands of large AI models like Nemotron 3 Super.
  • HP Z8 G5 Workstation — A powerful workstation designed for demanding tasks, suitable for running multi-agent systems and large-scale AI models.
  • Samsung NVMe 980 Pro 2TB M.2 SSD — Essential for fast data access and storage, crucial for efficiently handling the large datasets and parameters of AI models like Nemotron 3.

Frequently Asked Questions

What is NVIDIA Nemotron and who is it for?
NVIDIA Nemotron is a family of LLMs optimized for enterprise use — customer service, instruction following, and running efficiently on NVIDIA hardware. Nemotron-3 8B is designed to run fast on a single GPU. It's for companies wanting NVIDIA-optimized models for their AI infrastructure.
How does Nemotron compare to Llama 3?
Nemotron-3 8B and Llama 3.1 8B are close in quality on standard benchmarks. Nemotron has an edge on instruction-following and is more thoroughly tested on NVIDIA's TensorRT-LLM stack. For general use, Llama 3.1 8B has better community support and more quantized variants available.
Can I run Nemotron locally?
Yes — Nemotron-3 8B is available on Hugging Face in GGUF format. It runs via Ollama, LM Studio, or llama.cpp. At Q4 quantization, it needs 5GB VRAM — an RTX 3060 12GB runs it comfortably at 40-60 tok/s.
What is Nemotron best used for?
Nemotron excels at structured output, instruction following, and customer-facing applications. It's been specifically tuned for safety and helpfulness in business contexts. Not the best choice for creative writing or open-ended reasoning — Qwen 3 or Llama 3.3 are stronger there.
Is Nemotron available as a free API?
Yes — NVIDIA's API Catalog (build.nvidia.com) provides free API access to Nemotron models with generous rate limits for development. No NVIDIA GPU is required to use the API. This is a good way to test Nemotron before committing to local deployment.

🔧 Tools in This Article

All tools →

Related Guides

All guides →