AI Models

NVIDIA Nemotron 3 Ultra for Long-Running Agents

NVIDIA released Nemotron 3 Ultra, a 550B/55B-active open MoE model aimed at long-running agents. Here is what the model cards source, what stays vendor-reported, and who should watch it.

June 5, 2026·8 min read·1,545 words

On June 4, 2026, NVIDIA released Nemotron 3 Ultra, an open-weight Mixture-of-Experts model the company positions for long-running agents — the kind of multi-step systems that plan, call tools, read large contexts, and run for many turns before they finish a task. The useful signal here is not another leaderboard entry. It is that NVIDIA is framing agent performance as a systems problem, where model capability, context length, throughput, and deployment hardware have to be reasoned about together.

This article separates what NVIDIA's launch materials actually say from what they do not. Toolhalla has not tested Nemotron 3 Ultra hands-on, and NVIDIA's own posts are not independent benchmarks. Every throughput, cost, and accuracy figure below is a number NVIDIA reports, not a result we have reproduced.

Primary sources: the NVIDIA Technical Blog launch, the NIM model card on build.nvidia.com, the NIM API reference, and the Hugging Face cards for the BF16 and NVFP4 checkpoints.

What NVIDIA released

According to NVIDIA, Nemotron 3 Ultra is "a 550B-parameter Mixture-of-Experts model with 55B active parameters," released as a fully open package that NVIDIA describes as "fully open—including weights, data, and recipes." NVIDIA says the model is "built to help long-running agents complete tasks faster while lowering cost," and frames it around "frontier reasoning and orchestration in agentic systems."

The release is available through three NVIDIA surfaces: Hugging Face (BF16 and NVFP4 checkpoints), the NVIDIA NIM microservice, and build.nvidia.com. NVIDIA states the model releases are "moving to OpenMDW-1.1, the Linux Foundation's permissive license purpose-built for open AI model distributions." The Hugging Face cards name the license as the "OpenMDW License Agreement, version 1.1 (OpenMDW-1.1)" and describe it as covering commercial and non-commercial use. This is not Apache 2.0 — if license terms matter for your deployment, read the OpenMDW-1.1 text directly rather than assuming permissive-by-default behavior.

Why long-running agents change the model requirement

A chatbot can be good at single-turn answers and still be a poor agent. Long-running agents accumulate state across many steps: they hold large working contexts, call external tools, and pay a latency and cost tax on every turn. When an agent runs for dozens or hundreds of steps, two things dominate the experience — how much context the model can actually use, and how fast and cheaply each step runs.

That is the gap NVIDIA is aiming at. Instead of pitching a single benchmark win, the launch frames the value as throughput and cost per completed task. For builders, this reframing is the genuinely useful part: when you evaluate a model for an agent, the question is not only "is it smart enough" but "what does a full task cost and how long does it take end to end." Those are the metrics that decide whether an agent is viable in production.

The specs that matter: 550B total, 55B active, 1M context

The headline numbers, as stated by NVIDIA across the model cards:

  • 550B total parameters, 55B active. Only a fraction of the network runs per token because of the Mixture-of-Experts design, which is how a frontier-scale model can stay cheaper to run than a dense model of similar size.
  • Up to 1M tokens of context. The cards list a maximum context length of up to 1M tokens, which is the spec most directly relevant to agents that read large codebases, document sets, or long tool-call histories.
  • LatentMoE architecture. NVIDIA describes it as "Latent Mixture-of-Experts (LatentMoE) architecture, utilizing interleaved Mamba-2 and MoE layers, along with select Attention layers," plus "Multi-Token Prediction (MTP) layers for faster text generation and improved quality."
  • NVFP4 precision. NVIDIA says it used an "NVFP4 recipe to maximize compute efficiency" in pre-training, and ships a separate NVFP4 checkpoint for lower-footprint deployment.
  • Multilingual support. The NVFP4 card lists English, French, Spanish, Italian, German, Japanese, Korean, Hindi, Brazilian Portuguese, and Chinese.

The hybrid Mamba-2-plus-attention design and MTP layers are NVIDIA's stated mechanisms for the speed claims. Treat them as architectural facts from the model card, not as proof of any particular real-world latency on your workload.

What the NVIDIA claims do — and do not — prove

NVIDIA reports several strong numbers. State them as NVIDIA's claims, because that is what they are:

  • "5x higher throughput compared to other open models in its class."
  • "30% cost savings to complete SWE Bench verified benchmark."
  • "5x faster inference while delivering leading accuracy."

The model cards also list benchmark scores including SWE-Bench Verified 71.9, MMLU-Pro 86.8, LiveCodeBench (v6) 89.0, IOI 2025 570.0, RULER at 1M context 94.7, and a multilingual MMLU-ProX average of 83.0. The launch blog separately cites a "Long Context Ruler @1M" result of 95%.

What these numbers prove: NVIDIA is confident enough to publish specific, falsifiable figures on standard benchmarks. What they do not prove: that you will see the same throughput, cost, or accuracy on your agent's actual traffic. Vendor-run benchmarks use vendor-chosen hardware, batching, and prompts. The "in its class" comparison is also NVIDIA's framing — the company does not provide a head-to-head against closed frontier models with a shared methodology, so do not read these as a ranking against the best proprietary systems. Until there is independent evaluation, the right posture is to trust the specs (parameters, context, license, architecture) and to verify the performance claims against your own workload.

Deployment reality: NIM, Hugging Face, and data-center GPUs

This is not a model you run on a desktop. The BF16 model card lists a single-node minimum of "8× B200 (≈1.5 TB aggregate HBM — fits BF16 weights plus KV cache with headroom)," and a general requirement of "8x GB200/B200/GB300/B300, 16x H100, 8x H200." The NVFP4 checkpoint lowers the bar but is still data-center class: NVIDIA lists a minimum of "4xGB200, 4xB200, 4x GB300, 4x B300, 8xH100" and a recommended single node of "4× B200."

On the software side, the cards list support across vLLM (v0.22.0), SGLang (v0.5.12.post1), TensorRT-LLM (v1.3.0rc17, Blackwell only), and Docker Model Runner, with NeMo 26.04.01 noted as a runtime for the NVFP4 path. If you are weighing serving stacks for open models, our vLLM vs Ollama vs TGI comparison covers the trade-offs, though the consumer-friendly options there are not in scope for a model of this footprint.

The practical takeaway: for almost every team, Nemotron 3 Ultra is something you consume as a hosted endpoint (NIM on build.nvidia.com) or run on rented multi-GPU Blackwell or Hopper nodes — not something you self-host on a single workstation. Anyone planning a self-managed deployment should budget for an 8×-GPU class node at minimum and read the model card's hardware section as a hard floor, not a suggestion.

Toolhalla verdict: who should watch it now

Nemotron 3 Ultra is most interesting if you are building long-running, tool-using agents and want an open-weight option at frontier scale, with a million-token context and a license that NVIDIA describes as permissive. It is least interesting if you need something to run locally or on a budget GPU — the hardware requirements rule that out.

For directory purposes, this is a "watch" candidate: the specs and openness are real and verifiable from the model cards, but the throughput, cost, and accuracy advantages are vendor-reported and need independent confirmation before they belong in a buying decision. It is a distinct release from NVIDIA's earlier Nemotron-Labs diffusion language models and from the broader Nemotron 3 guide — the agent-orchestration angle is what makes this one worth tracking on its own.

FAQ

What is NVIDIA Nemotron 3 Ultra?

It is an open-weight Mixture-of-Experts language model NVIDIA released on June 4, 2026, with 550B total parameters and 55B active. NVIDIA positions it for long-running agents that need frontier reasoning, tool use, and long-context analysis.

Is Nemotron 3 Ultra open weight?

Yes. NVIDIA describes the release as "fully open—including weights, data, and recipes," distributed under the OpenMDW License Agreement v1.1 (OpenMDW-1.1), which the company calls a Linux Foundation permissive license. It is not Apache 2.0; read the OpenMDW-1.1 terms before relying on them.

Can I run Nemotron 3 Ultra locally?

Not in any normal sense. The BF16 card lists a single-node minimum of 8× B200 (about 1.5 TB aggregate HBM), and even the smaller NVFP4 checkpoint needs roughly 4× B200 or 8× H100. This is a data-center-class model, not a desktop or single-GPU one.

What does 550B total and 55B active mean?

The model has 550B parameters in total, but its Mixture-of-Experts design only activates about 55B of them for any given token. That routing is what lets a frontier-scale model run more cheaply than a dense 550B model would.

Is Nemotron 3 Ultra useful for coding agents?

Potentially. NVIDIA reports a SWE-Bench Verified score of 71.9 and a LiveCodeBench (v6) score of 89.0, and cites "30% cost savings to complete SWE Bench verified." Those are NVIDIA's numbers; validate them on your own coding-agent tasks before committing.

What should builders verify before adopting it?

Confirm the OpenMDW-1.1 license fits your use, the real cost and latency on your workload (not the vendor benchmark), the GPU footprint you can actually provision, and whether your serving stack (vLLM, SGLang, TensorRT-LLM, or NIM) supports the checkpoint you intend to run.

*Disclosure: This article references NVIDIA developer and Hugging Face resources. Toolhalla has not independently tested Nemotron 3 Ultra; all performance figures are NVIDIA-reported.*

Frequently Asked Questions

What is NVIDIA Nemotron 3 Ultra?
It is an open-weight Mixture-of-Experts language model NVIDIA released on June 4, 2026, with 550B total parameters and 55B active. NVIDIA positions it for long-running agents that need frontier reasoning, tool use, and long-context analysis.
Is Nemotron 3 Ultra open weight?
Yes. NVIDIA describes the release as "fully open—including weights, data, and recipes," distributed under the OpenMDW License Agreement v1.1 (OpenMDW-1.1), which the company calls a Linux Foundation permissive license. It is not Apache 2.0; read the OpenMDW-1.1 terms before relying on them.
Can I run Nemotron 3 Ultra locally?
Not in any normal sense. The BF16 card lists a single-node minimum of 8× B200 (about 1.5 TB aggregate HBM), and even the smaller NVFP4 checkpoint needs roughly 4× B200 or 8× H100. This is a data-center-class model, not a desktop or single-GPU one.
What does 550B total and 55B active mean?
The model has 550B parameters in total, but its Mixture-of-Experts design only activates about 55B of them for any given token. That routing is what lets a frontier-scale model run more cheaply than a dense 550B model would.
Is Nemotron 3 Ultra useful for coding agents?
Potentially. NVIDIA reports a SWE-Bench Verified score of 71.9 and a LiveCodeBench (v6) score of 89.0, and cites "30% cost savings to complete SWE Bench verified." Those are NVIDIA's numbers; validate them on your own coding-agent tasks before committing.
What should builders verify before adopting it?
Confirm the OpenMDW-1.1 license fits your use, the real cost and latency on your workload (not the vendor benchmark), the GPU footprint you can actually provision, and whether your serving stack (vLLM, SGLang, TensorRT-LLM, or NIM) supports the checkpoint you intend to run. Disclosure: This article references NVIDIA developer and Hugging Face resources. Toolhalla has not independently tested Nemotron 3 Ultra; all performance figures are NVIDIA-reported.

🔧 Tools in This Article

All tools →

Related Guides

All guides →
#NVIDIA#Nemotron#AI agents#open weights#long context#NIM#Toolhalla