Vercel AI Gateway Provider Sorting: Cost, Latency, and Throughput
Vercel AI Gateway now lets developers sort providers behind a model by cost, time to first token, or throughput. Here is what the new sort option changes, and what it still does not prove.
Vercel says its AI Gateway can now sort providers behind a model by cost, time to first token, or throughput. The change, published in the Vercel changelog on May 15, 2026 by Walter Korman and Jerilyn Zheng, gives developers an explicit knob to choose which provider behind a given model should be tried first.
For teams using the AI SDK against multi-provider models, this turns an implicit default into a deliberate routing policy. That is useful, but it does not change what a gateway can and cannot prove.
Source: Vercel changelog: Sort providers by cost, latency, or throughput on AI Gateway.
What Vercel changed on May 15, 2026
Vercel's default provider order blends provider reliability, quality of model output, cost, and response speed. According to the changelog, the new sort option on providerOptions.gateway lets developers override that blended default with one of three explicit criteria.
The values are:
cost: rank providers by listed input price per million tokens, lowest first.ttft: rank by median time to first token in milliseconds, lowest first.tps: rank by median throughput in tokens per second, highest first.
Vercel says ranking is computed at request time, so newly added providers, price changes, and shifts in observed latency or throughput flow through automatically without code changes. Providers are tried in sort order, and fallback to the next provider only happens when the higher-ranked one is unavailable.
The three sort modes
Cost
sort: 'cost' orders providers by listed input price per million tokens, cheapest first. Vercel's changelog uses GPT OSS 120B as an example: AI Gateway exposes more than five providers for that model, and they do not all charge the same per-token rate, which is the situation where price-first routing matters most.
This mode is positioned for high-volume, cost-sensitive work where input price is the dominant variable. Vercel's note describes it as ranking by listed input price, so output price, request volume, retries, and caching are not part of the sort itself.
TTFT
sort: 'ttft' orders providers by median time to first token in milliseconds, lowest latency first. The intent is to send latency-sensitive traffic to the provider that has historically responded quickest.
This mode is positioned for interactive or perceived-latency workloads: chat, autocomplete, and real-time agent steps where the person on the other end is waiting for the first chunk of output. Because ranking is computed at request time, the order tracks observed median latency over time rather than a static benchmark.
TPS
sort: 'tps' orders providers by median throughput in tokens per second, highest first. This mode is positioned for long-output generation, where the time to produce the full response matters more than the time to the first token.
For workloads like long summaries, batch reports, or large structured outputs, the provider with the lowest TTFT may not be the provider that finishes generation first. TPS-sorted routing targets that second case.
When each routing mode makes sense
The three modes map to three different shapes of workload:
- Use
costwhen input volume is high, latency is acceptable, and a lower per-million-token rate dominates the bill. Common cases include large-batch classification, embedding-adjacent rewrites, and back-office summarization. - Use
ttftwhen a person is waiting for the first chunk. Conversational UIs, agent loops with interactive feedback, and IDE-style assistants tend to live or die by first-token latency. - Use
tpswhen total wall-clock matters more than the first token. Long-form generation, code rewrites, and report rendering benefit when the chosen provider sustains a higher tokens-per-second rate.
Most production stacks will end up using more than one. A common pattern is ttft for interactive surfaces, cost for background jobs, and tps for long-running generations, all hitting the same model through the same gateway, with the sort chosen per request.
For a wider view of where this kind of routing fits alongside other gateway products, see our comparison of OpenRouter, LiteLLM, and Portkey in 2026.
What provider sorting does not prove
Sorting is a routing preference, not a guarantee. A few things worth being explicit about:
sort: 'cost'does not guarantee the lowest total bill. Vercel's note describes it as ranking by listed input price per million tokens. Output length, prompt caching behavior, retry rates on failures, and any output-token premium can change the actual invoice. The cheapest input price is not always the cheapest job.sort: 'ttft'andsort: 'tps'describe routing inputs, not delivered SLAs. Median latency or throughput is what the gateway uses to rank, not what any specific request is contractually guaranteed to receive. A tail-latency event still happens at the tail.- Sorting does not normalize quality. Vercel describes the default as also weighing model output quality. Once you override that default with a single dimension, you are explicitly accepting whatever quality variance exists between providers for that model. If a provider runs a quantized or otherwise different deployment, the sort does not surface that.
- Fallback is conditional. Vercel says the next provider is used only when the higher-ranked one is unavailable. That covers outages and errors. It does not automatically swap providers because the current one is slow on a particular request.
- Vercel did not publish hands-on benchmarks in the changelog. Anything not in the source — including specific price comparisons, latency numbers, or provider rankings that Toolhalla has not measured — is not something we are asserting.
Directory implications for Toolhalla and AI infra buyers
For Toolhalla's directory, the change reinforces a category trend rather than creating a new product class. AI Gateway already belonged in the LLM gateway and provider-routing bucket alongside OpenRouter, LiteLLM, and Portkey. The sort option is a feature update, not a new entry.
What it does change is what buyers should ask when evaluating an LLM gateway:
- Does it expose ranking criteria explicitly, or only as a blended default?
- Are rankings recomputed at request time, or pinned at deploy time?
- How is fallback triggered — outage, error rate, latency, or something else?
- Are listed prices the actual billed prices for your account, or list-price indicators?
A gateway that lets you choose between cost, ttft, and tps per request is one shape of answer. A gateway that pins routing in policy files or an admin UI is another. Neither is wrong, but they imply different operational habits.
For AI infrastructure buyers, the practical implication is to map each workload to a single dimension before turning on per-request sorting. If you cannot say whether a workload is cost-, latency-, or throughput-bound, picking a sort value is guesswork.
FAQ
Where is sort configured?
sort is set on providerOptions.gateway in the AI SDK request, per Vercel's changelog.
What values does sort accept?
cost, ttft, and tps. Vercel's changelog defines cost as listed input price per million tokens (lowest first), ttft as median time to first token in milliseconds (lowest first), and tps as median tokens per second (highest first).
Does sort: 'cost' guarantee the cheapest total bill?
No. The sort ranks by listed input price per million tokens. Output length, retries, caching, and any output-token premium can still change the final cost.
How does fallback work with sorted routing?
Vercel says providers are tried in sort order, and fallback to the next provider only happens when the higher-ranked one is unavailable.
Are the rankings static?
No. Vercel says ranking is computed at request time, so newly added providers, price changes, and shifts in observed latency or throughput flow through automatically without code changes.
Has Toolhalla tested this hands-on?
No. This article is a sourced summary of Vercel's May 15, 2026 changelog plus an analysis of what sort does and does not prove. We have not run our own measurements.
Sources
- Vercel changelog, "Sort providers by cost, latency, or throughput on AI Gateway" (May 15, 2026, by Walter Korman and Jerilyn Zheng): https://vercel.com/changelog/sort-providers-by-cost-latency-or-throughput-on-ai-gateway
- Vercel AI Gateway product page: https://vercel.com/ai-gateway
Frequently Asked Questions
Where is `sort` configured?
What values does `sort` accept?
Does `sort: 'cost'` guarantee the cheapest total bill?
How does fallback work with sorted routing?
Are the rankings static?
Has Toolhalla tested this hands-on?
🔧 Tools in This Article
All tools →Related Guides
All guides →AI Infrastructure Geopolitics: Why the Stargate Threat Matters
The Stargate UAE threat shows how AI infrastructure geopolitics now shapes compute concentration, location risk, and frontier AI resilience.
10 min read
AI InfrastructureAI Infrastructure Demand in 2026: Why Compute, Power, and Operations Are Tightening
AI infrastructure demand in 2026 is rising across open-source models, voice agents, public-sector AI, and AI-generated software. Here is why compute, power, and operations are becoming harder constraints.
9 min read
Local LLMGemma 4 Is Out: Apache 2.0, 3.8B Active Params, and the Best Local Model in 2026
Google dropped Gemma 4 on April 2 with four variants, a 256K context window, and — finally — an Apache 2.0 license. The 26B MoE activates only 3.8B params at inference. Here's what changed, what it means for local AI, and how it stacks up.
12 min read