NVIDIA Nemotron 3: Complete Guide to Super, Nano, and GenRM (2026)
NVIDIA's Nemotron 3 family explained: Super (120B), Nano (30B), and GenRM reward model. Specs, benchmarks, architecture, and how they compare to Qwen, GPT-OSS, and Llama.
NVIDIA isn't just making the GPUs that run AI models — they're building the models themselves. The Nemotron 3 family is their most ambitious open-weight release yet: three models designed for different scales, from edge devices to multi-agent production systems.
Here's the full breakdown: what each model does, how they compare to competitors, and who should actually use them.
The Nemotron 3 Family at a Glance
| Model | Total Params | Active Params | Architecture | Context | Release | License |
|---|---|---|---|---|---|---|
| Nemotron 3 Super | 120B | 12B | Hybrid Mamba-Transformer MoE | 1M tokens | March 2026 | Open weights |
| Nemotron 3 Nano | 31.6B | 3.2B | Hybrid Mamba-Transformer MoE | 1M tokens | December 2025 | Open weights |
| Qwen3-Nemotron-235B-A22B-GenRM | 235B | 22B | Transformer (Qwen3) | 128K tokens | March 2026 | Apache 2.0 |
| Nemotron 3 Ultra | ~500B | ~50B | TBA | TBA | Coming soon | TBA |
All three are open weight. All use Mixture-of-Experts (MoE), meaning only a fraction of parameters activate per token — keeping inference fast despite large total parameter counts.
Nemotron 3 Super (120B-A12B): The Agentic Workhorse
Super is the headline model. It's built for one purpose: running as the brain of autonomous AI agents at scale. And the stakes are real — companies are already cutting knowledge workers and citing AI agents as the replacement. Models like Super are what makes that possible.
Why It Matters
Multi-agent systems have a scaling problem. Agents re-send conversation history, tool outputs, and reasoning steps every turn — generating up to 15x more tokens than a standard chat interaction. This "context explosion" causes goal drift over long tasks, and using massive reasoning models for every sub-task makes things expensive and slow.
Super solves this with efficiency-first architecture design.
Architecture Deep Dive
Hybrid Mamba-Transformer MoE. Super interleaves three layer types:
- Mamba-2 layers handle most sequence processing with linear-time complexity. This is what makes the 1M-token context window practical, not theoretical.
- Transformer attention layers are interleaved at key depths for precise associative recall — finding specific facts buried in long contexts.
- MoE layers scale effective capacity without dense computation costs. Only a subset of experts activates per token.
Latent MoE. Before routing decisions, token embeddings are compressed into a low-rank latent space. Expert computation happens in this smaller dimension. The result: 4x as many expert specialists can be consulted for the same inference cost. This enables fine-grained specialization — distinct experts for Python syntax vs. SQL logic vs. conversational reasoning.
Multi-Token Prediction (MTP). Instead of predicting one token at a time, Super forecasts several future tokens simultaneously. This improves reasoning quality during training (the model must learn longer-range logical dependencies) and enables built-in speculative decoding at inference for up to 3x speedups on structured generation.
Native NVFP4 Pretraining. Super was trained natively in NVIDIA's 4-bit floating-point format, optimized for Blackwell GPUs. No post-training quantization loss — the model learns to be accurate within 4-bit constraints from the first gradient update.
Training Pipeline
1. Pretraining: 25 trillion tokens (10T unique), spanning code, math, science, and general knowledge
2. Supervised fine-tuning: ~7 million samples from a 40M-sample corpus covering reasoning, coding, safety, and agent tasks
3. Multi-environment RL: 1.2 million+ environment rollouts across 21 configurations using NeMo Gym and NeMo RL
Benchmark Results
| Benchmark | Nemotron 3 Super | GPT-OSS-120B | Qwen3.5-122B |
|---|---|---|---|
| Artificial Analysis Intelligence Index | 36 | 33 | 42 |
| PinchBench (agentic, full suite) | 85.6% | — | — |
| GDPval-AA (agentic ELO) | 1027 | — | — |
| Terminal-Bench Hard | 29% | — | — |
| DeepResearch Bench | #1 (open models) | — | — |
| RULER 1M context | Outperforms both | ✓ | ✓ |
Throughput: Up to 2.2x higher than GPT-OSS-120B and 7.5x higher than Qwen3.5-122B on 8K input / 16K output. Provider APIs show speeds up to 449–484 tokens/second.
Efficiency vs. intelligence trade-off: 11% higher throughput per NVIDIA B200 GPU than GPT-OSS-120B, with comparable or higher accuracy. Qwen3.5-122B scores 6 points higher on intelligence but at 40% lower throughput per GPU.
Available Formats
- NVFP4 (optimized for Blackwell)
- FP8
- BF16
- Base model (BF16, pre-post-training)
All on Hugging Face and via NVIDIA NIM.
Nemotron 3 Nano (30B-A3B): Edge-Ready Efficiency
Nano is the small model that punches above its weight. Released in December 2025, it's designed for local deployment on consumer and edge hardware.
Key Specs
- 31.6B total, 3.2B active parameters (3.6B with embeddings)
- Hybrid Mamba-Transformer MoE — same architectural family as Super
- 1M token context window
- Configurable reasoning depth at inference time
Benchmark Highlights
| Benchmark | Nemotron 3 Nano | Qwen3-30B-A3B | GPT-OSS-20B |
|---|---|---|---|
| MATH | 82.88% | 61.14% | — |
| HumanEval (code) | 78.05% | — | — |
| RULER 64K (long context) | 87.5% | — | — |
| RULER 1M | 86.3% | — | — |
| Inference throughput (vs peers) | Baseline | 3.3x slower | 2.2x slower |
Nano dominates on math reasoning and long-context tasks while activating less than half the parameters of its predecessor (Nemotron 2 Nano). The 3.3x throughput advantage over Qwen3-30B-A3B comes from the Mamba-2 architecture's linear-time sequence processing.
Who Should Use Nano
- Local LLM enthusiasts running on consumer GPUs (RTX 4090/5090 territory with FP8)
- Edge deployments where latency and cost-per-token matter more than peak intelligence
- High-volume engineering tasks where throughput is the governing constraint
- Agent sub-tasks that don't need Super's full reasoning capacity — see guardrails for agent I/O for practical examples
Available Formats
- FP8
- BF16
- Base model (BF16)
Qwen3-Nemotron-235B-A22B-GenRM: The Judge Model
This is the most specialized — and most misunderstood — model in the family. GenRM is not a chat model. It's a Generative Reward Model used to train the other Nemotron 3 models via RLHF.
What It Does
Given a conversation history, a user request, and two candidate responses, GenRM:
1. Scores each response individually on helpfulness (1–5 scale)
2. Produces a comparative ranking (1–6 scale, from "Response 1 is much better" to "Response 2 is much better")
Architecture
- Built on Qwen3-235B-A22B-Thinking-2507 as foundation
- 235B total, 22B active parameters
- Fine-tuned with GRPO algorithm on preference data from HelpSteer3 and Arena Human Preference datasets
- 128K token context window
- Requires 8x GPU tensor parallelism for serving
Why It Matters
Traditional reward models use a Bradley-Terry framework that tends to overfit — leading to "reward hacking" where models learn to game the scoring rather than genuinely improving. GenRM uses a generative approach: it reasons through its evaluation in natural language before scoring, which generalizes better across tasks and reduces reward hacking during RLHF.
NVIDIA used GenRM to train both Nemotron 3 Super and Nano. By releasing it openly, they're enabling the community to replicate and extend their training pipeline.
Who Should Use GenRM
- Model trainers doing RLHF or DPO on their own models
- Evaluation pipelines needing automated quality assessment
- Research teams studying reward model behavior
This is not a model you'd deploy for chat or code generation. It's infrastructure for building better models.
How Nemotron 3 Compares to the Competition
vs. Qwen3.5 (122B-A10B)
Qwen3.5 scores higher on raw intelligence benchmarks (+6 points on Artificial Analysis Index) but at 40% lower throughput per GPU. If you're running many concurrent agents or processing high volumes, Super's efficiency advantage compounds fast. For teams choosing between Qwen versions for local deployment, see our Qwen 3.5 vs Qwen 2.5 benchmark comparison and should you upgrade to Qwen 3.5 guides. For teams choosing between Qwen versions for local deployment, see our Qwen 3.5 vs Qwen 2.5 benchmark comparison and should you upgrade to Qwen 3.5 guides.
vs. GPT-OSS-120B
Super matches or exceeds GPT-OSS-120B on accuracy while delivering 11% higher throughput per B200 GPU and 2.2x higher throughput on the standard 8K/16K test. Super also offers significantly more open training data and methodology disclosure.
vs. Llama (Meta)
Different design philosophy entirely. Meta's Llama models are dense Transformers. Nemotron 3 uses hybrid Mamba-Transformer MoE — trading peak single-query performance for dramatically better throughput at scale. For agentic workloads running many sessions concurrently, the MoE approach has clear advantages.
vs. Previous Nemotron (Super 1.0)
The new Super delivers 5x higher inference throughput than the previous Nemotron Super, with comparable or better accuracy. If you deployed the earlier version, upgrade is straightforward — same model APIs, better everything.
Getting Started
Try It Now (API)
The fastest path: use NVIDIA NIM for instant API access. Also available from Lightning AI and DeepInfra.
Self-Host
Super requires serious hardware — 8x B200 or equivalent for full precision. NVFP4 quantization brings it down to a single multi-GPU node on Blackwell.
Nano is more accessible. FP8 runs on a single H200 or (with aggressive quantization) on high-end consumer GPUs. The DGX Spark can also handle Nano with its 128 GB unified memory.
Both models work with vLLM out of the box:
# Nemotron 3 Super (FP8)
python3 -m vllm.entrypoints.openai.api_server \
--model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8 \
--trust-remote-code \
--tensor-parallel-size 8
# Nemotron 3 Nano (FP8)
python3 -m vllm.entrypoints.openai.api_server \
--model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
--trust-remote-code
Training Recipes
NVIDIA released everything: pretraining data, SFT datasets, RL environments, and full model recipes via the Nemotron Developer Repository.
The Bottom Line
Nemotron 3 is NVIDIA's statement that open models can be both competitive *and* efficient. Super isn't the smartest open model (Qwen3.5-122B edges it on benchmarks), but it's the most practical for agentic production workloads — and it's the most intelligent model ever released at this level of openness.
Choose Super if you're running multi-agent systems, need 1M context, or care about throughput-per-GPU economics.
Choose Nano if you need a fast, capable model for local/edge deployment that handles long context and reasoning well.
Use GenRM if you're training your own models and want NVIDIA's RLHF infrastructure.
The Ultra model (roughly 500B total, 50B active) is still coming. When it arrives, the Nemotron 3 family will cover everything from edge to frontier.
*Compare Nemotron 3 with 100+ other AI tools → toolhalla.ai/models*
*Run it locally? Check our Ollama setup guides and GPU hardware recommendations.*
FAQ
What is NVIDIA Nemotron and who is it for?
NVIDIA Nemotron is a family of LLMs optimized for enterprise use — customer service, instruction following, and running efficiently on NVIDIA hardware. Nemotron-3 8B is designed to run fast on a single GPU. It's for companies wanting NVIDIA-optimized models for their AI infrastructure.
How does Nemotron compare to Llama 3?
Nemotron-3 8B and Llama 3.1 8B are close in quality on standard benchmarks. Nemotron has an edge on instruction-following and is more thoroughly tested on NVIDIA's TensorRT-LLM stack. For general use, Llama 3.1 8B has better community support and more quantized variants available.
Can I run Nemotron locally?
Yes — Nemotron-3 8B is available on Hugging Face in GGUF format. It runs via Ollama, LM Studio, or llama.cpp. At Q4 quantization, it needs ~5GB VRAM — an RTX 3060 12GB runs it comfortably at 40-60 tok/s.
What is Nemotron best used for?
Nemotron excels at structured output, instruction following, and customer-facing applications. It's been specifically tuned for safety and helpfulness in business contexts. Not the best choice for creative writing or open-ended reasoning — Qwen 3 or Llama 3.3 are stronger there.
Is Nemotron available as a free API?
Yes — NVIDIA's API Catalog (build.nvidia.com) provides free API access to Nemotron models with generous rate limits for development. No NVIDIA GPU is required to use the API. This is a good way to test Nemotron before committing to local deployment.
Recommended Hardware
Recommended Products
- NVIDIA GeForce RTX 5090 — A high-performance GPU that can handle the computational demands of large AI models like Nemotron 3 Super.
- HP Z8 G5 Workstation — A powerful workstation designed for demanding tasks, suitable for running multi-agent systems and large-scale AI models.
- Samsung NVMe 980 Pro 2TB M.2 SSD — Essential for fast data access and storage, crucial for efficiently handling the large datasets and parameters of AI models like Nemotron 3.
Frequently Asked Questions
What is NVIDIA Nemotron and who is it for?
How does Nemotron compare to Llama 3?
Can I run Nemotron locally?
What is Nemotron best used for?
Is Nemotron available as a free API?
🔧 Tools in This Article
All tools →Related Guides
All guides →Meta and Broadcom April 2026: Why Custom AI Silicon Matters More Now
Meta and Broadcom April 2026: Why Custom AI Silicon Matters More Now Meta's April 14, 2026 announcement of an expanded Broadcom partnership is a useful reminder that AI competition is increasingly fought below the API layer. Meta said it...
2 min read
AI ToolsMeta Muse Spark April 2026: What It Means for Consumer AI Assistants
Meta Muse Spark April 2026: What It Means for Consumer AI Assistants Meta's April 8, 2026 announcement of Muse Spark matters because it is not just another model launch. Meta is trying to reposition Meta AI around multimodal perception,...
2 min read
AI ToolsProject Glasswing April 2026: The AI Cybersecurity Shift Is Here
Project Glasswing April 2026: The AI Cybersecurity Shift Is Here Anthropic's April 7, 2026 announcement of Project Glasswing is one of the clearest recent signs that frontier AI labs now see cybersecurity as a central deployment battleground, not a...
2 min read