AI Tools

llm-d Joins CNCF Sandbox: Kubernetes-Native LLM Inference Is Here

IBM, Red Hat, and Google's llm-d has been accepted into the CNCF Sandbox — bringing production-grade, Kubernetes-native LLM inference to the cloud-native stack. Here's what it means for teams running vLLM and KServe at scale.

March 31, 2026·10 min read·1,648 words

In short: llm-d is a Kubernetes-native LLM inference framework from IBM, Red Hat, and Google, accepted into the CNCF Sandbox in late March 2026. It adds inference-aware (prefix-cache) routing, disaggregated prefill/decode, and KV cache offloading atop vLLM and KServe, rather than replacing them.

Most of the AI infrastructure conversation in 2026 has been about models: which one benchmarks higher, which one costs less per token, which one just dropped. But there's a quieter story that matters more for teams actually running LLM workloads in production: how does inference infrastructure scale?

The answer, increasingly, is Kubernetes. And the project that just got a major institutional endorsement for that path is llm-d — a joint effort from IBM, Red Hat, and Google that was accepted into the CNCF Sandbox in late March 2026, timed to KubeCon Europe. This move towards Kubernetes-native solutions like llm-d highlights the shift away from traditional hardware-centric approaches such as Arm's Custom AGI CPU: 136 Cores, 3nm, and the End of Nvidia-Only Inference.

If you're running vLLM, KServe, or any production-grade inference stack, this is worth understanding before it becomes table stakes.

What Is llm-d?

llm-d (pronounced "LLM-d", no official pronunciation yet confirmed) is a Kubernetes-native framework for large language model inference. The project is designed to close the gap between cloud-native infrastructure patterns and the specific demands of LLM serving — a gap that has been awkward and expensive for most production teams.

At its core, llm-d is built around one insight: standard Kubernetes load balancing is wrong for LLM inference.

Traditional HTTP routing doesn't know or care about KV cache state. It treats every request as independent. But LLM inference has expensive stateful structure — the KV cache built up during a session represents real compute that would be wasted if requests are blindly routed to a pod that doesn't have the relevant cache warm. This is where solutions like llm-d shine, offering more efficient use of resources compared to traditional setups.

llm-d solves this with inference-aware routing.

The Technical Architecture

Inference-Aware Routing with Prefix Cache

The flagship feature: llm-d's scheduler is aware of which pods have which KV cache prefixes warm. When a new request arrives, it routes to the pod that already has the most relevant cache state — rather than distributing requests randomly or by simple round-robin. This approach not only optimizes resource usage but also enhances performance, making it a critical component for deploying large language models at scale.

For workloads with shared sys

This sophisticated routing mechanism is particularly beneficial in environments where multiple agents need to be orchestrated efficiently, as discussed in Multi-Agent Orchestration: A Practical Guide for 2026. By ensuring that requests are directed to the most appropriate pods, llm-d helps maintain optimal performance and resource utilization.

tem prompts (which is most production applications), this can dramatically reduce the effective compute cost per request. A 4,096-token system prompt that's already cached is free to serve. Without cache-aware routing, it gets recomputed on every cold hit.

Real-world impact: In benchmarks, llm-d has demonstrated throughput of approximately 120,000 tokens per second across distributed configurations. That's not a toy number.

Disaggregated Prefill and Decode

llm-d supports separating the prefill phase (processing the full prompt) from the decode phase (generating tokens one-by-one). These two operations have very different compute profiles:

Prefill is compute-bound: it processes all input tokens in parallel. Latency is dominated by memory bandwidth and compute throughput.
Decode is memory-bandwidth-bound: it generates one token at a time, accessing the full KV cache on each step.

Running both phases on the same GPU means neither is optimized. Disaggregated prefill/decode lets you assign different hardware to each phase — higher GPU utilization, lower cost per token, better tail latency.

This is the same approach that Google and others use internally. llm-d brings it to the Kubernetes-native world.

KV Cache Offloading

KV cache offloading allows llm-d to spill cache to CPU memory or even disk when GPU VRAM is the bottleneck. In practice, this means:

GPU memory for hot, recently accessed KV cache entries
CPU memory as a second tier for warm-but-not-active entries
NVMe/disk as a third tier for cold cache (depending on latency requirements)

This hierarchy enables much larger effective context lengths than the raw VRAM of any single GPU would support.

Accelerator-Agnostic

llm-d works across NVIDIA GPUs, AMD GPUs, and Google TPUs. The abstraction layer doesn't assume any specific vendor, which matters as teams increasingly mix hardware across on-premise and cloud environments.

How It Fits With vLLM and KServe

llm-d is not a replacement for either vLLM or KServe — it's a routing and orchestration layer that bridges them.

Component	Role
vLLM	Inference engine (model execution, KV cache management)
KServe	Model serving abstraction on Kubernetes
llm-d	Inference-aware routing, disaggregated prefill/decode scheduling, KV cache offloading coordination

The architecture assumes vLLM as the backing inference engine and KServe as the serving framework. llm-d adds the missing scheduler that makes the whole stack cache-aware at the Kubernetes networking layer.

If you're already running vLLM + KServe in production, llm-d is a drop-in routing upgrade — not a re-architecture.

Why CNCF Sandbox Matters

The Cloud Native Computing Foundation (CNCF) Sandbox is the entry point for early-stage cloud-native projects that the CNCF's Technical Oversight Committee has judged to be solving real problems with sound approaches. Past Sandbox graduates include Prometheus, Envoy, Helm, Argo, and Flux — projects that became industry standards.

CNCF Sandbox acceptance for llm-d means:

1. Neutral governance: The project moves out of pure vendor control (IBM/Red Hat/Google) into a vendor-neutral foundation with documented governance processes.

2. Security review: CNCF projects undergo security audits and must meet disclosure standards.

3. Ecosystem integration pressure: Other CNCF projects will increasingly ensure compatibility with llm-d as a peer.

4. Talent signal: Engineers know CNCF projects tend to persist and grow. Hiring and contribution follow.

The timing — KubeCon Europe 2026 — was deliberate. This is the audience that will adopt, contribute to, and eventually standardize on llm-d if the project succeeds.

Who Should Pay Attention

Teams Running vLLM in Production

If you're serving LLM traffic with vLLM today and you're on Kubernetes, llm-d is probably the next meaningful infrastructure upgrade to evaluate. The cache-aware routing alone can meaningfully reduce cost per token for workloads with shared prefixes.

Platform Teams at Enterprises Using Red Hat OpenShift

IBM and Red Hat's involvement is not accidental. Expect llm-d to become a first-class feature of Red Hat OpenShift AI (RHOAI) in the medium term. If your organization runs OpenShift, this is the project to watch.

Teams Hitting GPU Memory Ceilings

KV cache offloading is genuinely useful for extending effective context without horizontal scaling. If you're regularly hitting VRAM limits on large-context workloads, llm-d's tiered offloading may be worth testing.

Anyone Building Multi-Tenant Inference Platforms

The combination of inference-aware routing and disaggregated prefill/decode maps well onto multi-tenant scenarios where different tenants have different system prompts. Cache partitioning by prefix is exactly the right primitive.

The Competitive Landscape

llm-d isn't the only project in this space, but it's the one with the most institutional weight behind it and now CNCF governance.

Competitors and adjacent projects:

NVIDIA Triton Inference Server: NVIDIA's production-grade inference server, mature but NVIDIA-centric. No native Kubernetes-aware cache routing.
Ray Serve: Distributed inference via Ray, Python-first, flexible but not Kubernetes-native in the same way.
Karpenter + kube-scheduler: Generic Kubernetes scheduling, not inference-aware.
OpenAI's internal infra: Not public.

llm-d's advantage is the combination of Kubernetes-nativeness, vendor neutrality, and backing from organizations that have real production inference workloads (IBM Watson, Google Cloud).

What's Next for llm-d

The project is early. CNCF Sandbox is the beginning of a process, not the end of one. Based on the project roadmap and CNCF Sandbox graduation requirements, expect:

Production hardening: More extensive testing documentation, conformance tests, security audit
OpenShift integration: Red Hat will likely ship llm-d as an OpenShift operator
Broader hardware support: Intel Gaudi, AWS Trainium, and Inferentia are natural targets
Benchmark expansion: More public benchmarks beyond the ~120k tokens/sec headline

The CNCF graduation process (moving from Sandbox to Incubating to Graduated) typically takes 2-4 years. Helm took 3 years. Argo took 2. llm-d will take whatever time it takes to demonstrate production adoption at the scale CNCF requires.

The Bottom Line

llm-d represents something genuinely new: inference infrastructure that's actually cloud-native, not just "running on Kubernetes." The three technical primitives — inference-aware routing, disaggregated prefill/decode, and KV cache offloading — solve real problems that current Kubernetes deployments handle poorly or not at all.

The CNCF Sandbox acceptance is an institutional endorsement that the approach is sound and worth the ecosystem's investment. For teams running production LLM inference, this is the sleeper infrastructure story of Q1 2026.

Read the CNCF announcement: Welcome llm-d to the CNCF Sandbox

FAQ

What is llm-d?

llm-d is a Kubernetes-native LLM inference framework developed by IBM, Red Hat, and Google. It adds inference-aware routing, disaggregated prefill/decode scheduling, and KV cache offloading to Kubernetes-based LLM serving stacks — primarily designed to work alongside vLLM and KServe.

What does CNCF Sandbox acceptance mean?

CNCF Sandbox is the entry tier for early-stage cloud-native projects. It signals vendor-neutral governance, security review processes, and ecosystem integration pressure. Notable past CNCF Sandbox graduates include Prometheus, Helm, Argo, and Flux.

How does llm-d improve LLM inference performance?

By routing requests to the Kubernetes pod that already has the relevant KV cache prefix warm (cache-aware routing), separating compute-heavy prefill from memory-bandwidth-heavy decode (disaggregated prefill/decode), and offloading KV cache to CPU/disk when GPU VRAM is exhausted.

Does llm-d replace vLLM or KServe?

No. llm-d acts as a scheduling and routing layer that bridges KServe and vLLM. It doesn't replace either — it makes the combination more efficient at scale.

What hardware does llm-d support?

llm-d is accelerator-agnostic, with support for NVIDIA GPUs, AMD GPUs, and Google TPUs.

Frequently Asked Questions

What is llm-d?

What does CNCF Sandbox acceptance mean?

How does llm-d improve LLM inference performance?

Does llm-d replace vLLM or KServe?

No. llm-d acts as a scheduling and routing layer that bridges KServe and vLLM. It doesn't replace either — it makes the combination more efficient at scale.

What hardware does llm-d support?

llm-d is accelerator-agnostic, with support for NVIDIA GPUs, AMD GPUs, and Google TPUs.

🔧 Tools in This Article

Make (Integromat)

vLLM

Dust

Related Guides

All guides →

Hardware

Arm's Custom AGI CPU: 136 Cores, 3nm, and the End of Nvidia-Only Inference

Arm returned to custom silicon after 35 years with a 136-core, 3nm data center chip purpose-built for AI inference. Meta, OpenAI, Cerebras, and Cloudflare are launch customers. Here's what it means for the inference compute stack.

11 min read

Local LLM

Gemma 4 Is Out: Apache 2.0, 3.8B Active Params, and the Best Local Model in 2026

Google dropped Gemma 4 on April 2 with four variants, a 256K context window, and — finally — an Apache 2.0 license. The 26B MoE activates only 3.8B params at inference. Here's what changed, what it means for local AI, and how it stacks up.

12 min read

Developer Tools

vLLM 0.22.0: DeepSeek V4, MRv2 and KV Offload

vLLM 0.22.0 is a production-serving release: DeepSeek V4 hardening, MRv2 progress, KV cache offloading, Rust frontend work and performance changes worth benchmarking.

6 min read

#llm#kubernetes#inference#infrastructure#open-source#vllm#api