AI Tools

vLLM vs Ollama vs TGI: Which Inference Server Should You Use?

Mistral released Small 4 on March 16, 2026. It has 119 billion parameters but activates only 6 billion per token during inference. It ships under Apache…

March 16, 2026·9 min read·1,854 words

Mistral released Small 4 on March 16, 2026. It has 119 billion parameters but activates only 6 billion per token during inference. It ships under Apache 2.0, supports a 256K context window, handles both text and image inputs, and includes configurable reasoning depth you can toggle per request.

On LiveCodeBench, it outperforms GPT-OSS 120B while generating 20% fewer tokens. On logical reasoning benchmarks, it matches Qwen models while using 3.5x less output. This is what Mixture of Experts was supposed to deliver — and Small 4 actually does.

How Mixture of Experts Works Here

Most language models are "dense" — every parameter participates in every token. A 70B dense model uses all 70 billion parameters for each word it generates. That's why big models need big GPUs.

Mixture of Experts (MoE) takes a different approach. Small 4 has 128 specialized expert networks, but only routes each token through 4 of them. The result: 119B total parameters for knowledge capacity, but only ~6B active parameters per token for compute cost.

In practical terms, running Small 4 uses roughly the same GPU bandwidth and arithmetic as a 6-8B dense model. The model is smart like a 119B model but fast like a 6B one.

For a deeper explanation of how quantization and model optimization work, see our quantization guide.

Benchmarks

Numbers from Mistral's published evaluations, cross-referenced with third-party testing:

Coding (LiveCodeBench)

Model Score Output Length
Mistral Small 4 (reasoning=high) 55.2 ~2.1K tokens
GPT-OSS 120B 53.8 ~2.6K tokens
Qwen 2.5 Coder 32B 51.4 ~3.2K tokens

Small 4 generates shorter, more precise code. Fewer tokens means faster completion and lower cost when using the API.

Logical Reasoning (AA LCR)

Model Accuracy Output Length
Mistral Small 4 0.72 1.6K chars
Qwen 3 32B 0.71 5.8K chars
Qwen 2.5 72B 0.73 6.1K chars

Small 4 reaches the same accuracy as models with 4-10x more active parameters, while producing dramatically less verbose output. This matters for agent workflows where every token adds latency and cost.

General Performance

Capability Mistral Small 4 Mistral Small 3 Improvement
Latency Baseline +40% slower 40% faster
Throughput 3x 1x 3x more requests/sec
Context window 256K 128K 2x longer
Modalities Text + Image Text only Added vision

For full leaderboard context on how this stacks up against other open models, check our open-source LLM leaderboard.

Configurable Reasoning

Small 4 introduces a reasoning_effort parameter that changes how the model thinks — not just what it says, but how much compute it spends per response:

Setting Behavior Best For
none Fast chat mode, similar to Small 3.2 Autocomplete, simple Q&A, classification
low Light reasoning, concise answers Summarization, translation, extraction
medium Balanced reasoning General assistant tasks, writing
high Deep step-by-step reasoning Math, coding, complex analysis

This replaces the old pattern of deploying separate models for different tasks. One model, one deployment, adjustable per request:


from mistral import Mistral

client = Mistral(api_key="your-key")

# Fast response for simple classification
fast = client.chat.complete(
    model="mistral-small-latest",
    messages=[{"role": "user", "content": "Is this email spam? ..."}],
    reasoning_effort="none"
)

# Deep reasoning for code review
deep = client.chat.complete(
    model="mistral-small-latest",
    messages=[{"role": "user", "content": "Review this function for bugs: ..."}],
    reasoning_effort="high"
)

How to Run Mistral Small 4 Locally

The hardware requirements depend heavily on quantization level. Here's the realistic picture:

Full Precision (BF16)

Setup Hardware
Minimum 4x NVIDIA H100 80GB
Recommended 2x NVIDIA H200 141GB

Full precision is datacenter territory. Not for local use.

Quantized (FP4/Q4)

Setup Hardware Context
Minimum 1x RTX 4090 (24GB) ~8K context
Comfortable 1x RTX 5090 (32GB) ~32K context
Recommended 2x RTX 4090 or 1x A100 80GB Full 256K context

At Q4 quantization, the MoE architecture means you only need bandwidth for ~6B active parameters per forward pass, but you still need to fit all 119B parameters in memory. The model's weight footprint at Q4 is approximately 60 GB, which is why multi-GPU setups or high-VRAM cards are necessary.

For GPU recommendations, see our best GPUs for local AI guide.


pip install vllm>=0.18.0

vllm serve mistralai/Mistral-Small-4-119B-2603 \
  --tensor-parallel-size 2 \
  --max-model-len 32768 \
  --quantization fp8

With Ollama

Ollama supports Mistral Small 4 through community-provided quantized GGUF files:


ollama pull mistral-small:latest
ollama run mistral-small:latest

Note: The default mistral-small tag in Ollama may point to the older 24B dense version, not the new 119B MoE. Check the model card to confirm you're pulling the right version. For more on choosing inference servers, see our vLLM vs Ollama vs TGI comparison.

Who Should Use Mistral Small 4

API developers who want one endpoint for everything. The configurable reasoning means you deploy once and adjust per request, rather than managing separate models for chat, coding, and analysis.

Open-source teams building products. Apache 2.0 means no license restrictions — you can fine-tune, distill, and deploy commercially with zero royalties.

Cost-conscious builders replacing GPT-4 class API calls. Mistral's API pricing for Small 4 is significantly lower than GPT-5.4, with competitive quality for most tasks.

Local AI enthusiasts with multi-GPU setups. If you have 2x RTX 4090s or similar, this is one of the most capable models you can self-host. It slots well alongside models we've covered in our Llama vs Mistral vs Phi comparison.

Who Should Skip It

Single-GPU users with 24GB or less. At Q4 quantization, the 119B MoE barely fits in 24GB and context length will be severely limited. You'll get better results from a well-quantized 32B dense model. Check our recommendations for best LLMs for 24GB GPUs.

Teams needing only text generation. If you don't need vision, coding, or reasoning modes, a smaller focused model will be faster and cheaper.

Apple Silicon users. MoE models don't run efficiently on unified memory architectures yet. For Mac-specific recommendations, see our Apple Silicon LLM guide.

The Bigger Picture

Mistral Small 4 represents where open-source AI is headed: MoE architectures that deliver frontier-class performance at a fraction of the compute cost. A year ago, matching GPT-4 required a 70B dense model and serious hardware. Now, a model with 6B active parameters does it under an Apache 2.0 license.

The gap between proprietary and open models continues to narrow. For most production tasks — coding, analysis, multilingual support, document understanding — Small 4 is good enough. That's not faint praise. "Good enough with full control" beats "slightly better behind an API" for a growing number of teams.


FAQ

Is Mistral Small 4 really open source?

Yes. It's released under Apache 2.0, which is one of the most permissive licenses available. You can use it commercially, modify it, fine-tune it, and distribute it without restrictions.

How does Mistral Small 4 compare to Llama 3.3 70B?

Small 4 generally outperforms Llama 3.3 70B on coding and reasoning benchmarks while using less compute per token. Llama 3.3 70B is easier to run on a single GPU since it's a dense model with simpler memory requirements. If you have the VRAM for either, Small 4 is the more capable choice.

Can I fine-tune Mistral Small 4?

Yes, the Apache 2.0 license permits fine-tuning. However, fine-tuning a 119B MoE model requires significant resources. Most teams use LoRA adapters to reduce the compute needed. Mistral provides fine-tuning documentation on their platform.

What context window does Mistral Small 4 support?

256K tokens, which is enough for entire codebases or book-length documents. However, running at full context requires substantially more VRAM than shorter context. At Q4 on a 24GB GPU, expect usable context around 4-8K tokens.

Logical Reasoning (AA LCR)

Model Accuracy Output Length
Mistral Small 4 (reasoning=high) 92.5% ~1.5K tokens
Qwen 2.5 90.2% ~2.5K tokens
GPT-OSS 120B 88.7% ~3.0K tokens

Mistral Small 4 not only generates shorter outputs but also achieves higher accuracy in logical reasoning tasks. This efficiency is crucial for applications where both speed and precision are important, such as financial modeling or legal document analysis.

Practical Examples

Example 1: Code Generation

Imagine you are a software developer working on a large codebase. You need to generate a function that sorts a list of integers in Python. Using Mistral Small 4, you can achieve this with a concise and efficient output:

Prompt: "Write a Python function to sort a list of integers."

Output:


def sort_integers(lst):
    return sorted(lst)

In contrast, GPT-OSS 120B might generate a more verbose solution, which could include unnecessary comments or additional functionality.

Example 2: Logical Reasoning

Consider a scenario where you need to solve a complex logical reasoning problem, such as determining the validity of a syllogism. Mistral Small 4 can provide a precise and accurate response:

Prompt: "All men are mortal. Socrates is a man. Therefore, Socrates is mortal. Is this argument valid?"

Output: "Yes, the argument is valid. It follows the form of a classic syllogism: All A are B. C is A. Therefore, C is B."

Inference Servers: vLLM vs Ollama vs TGI

When deploying large language models like Mistral Small 4, the choice of inference server can significantly impact performance and cost. Here’s a detailed comparison of vLLM, Ollama, and TGI.

vLLM (VLLM)

Key Features:

  • Scalability: vLLM is designed to scale across multiple GPUs, making it suitable for high-throughput applications.
  • Efficiency: It uses advanced techniques like model parallelism and pipeline parallelism to minimize latency and maximize throughput.
  • Compatibility: vLLM supports a wide range of models and frameworks, including Mistral, GPT, and more.

Use Case: Ideal for large-scale deployments where high throughput and low latency are critical, such as in cloud-based AI services.

Hardware Requirements:

  • GPUs: NVIDIA A100 80GB or similar
  • VRAM: Minimum 80GB per GPU for optimal performance

Ollama

Key Features:

  • Ease of Use: Ollama provides a user-friendly interface and API, making it accessible for developers with varying levels of expertise.
  • Integration: It integrates seamlessly with existing workflows and tools, allowing for quick deployment and scaling.
  • Cost-Effective: Ollama offers cost-effective solutions by optimizing resource usage and reducing idle time.

Use Case: Suitable for businesses looking to deploy AI models with minimal setup and maintenance effort, such as small to medium-sized enterprises.

Hardware Requirements:

  • GPUs: NVIDIA RTX 3090 or similar
  • VRAM: Minimum 24GB per GPU

TGI (Text Generation Inference)

Key Features:

  • Performance: TGI is optimized for high-performance text generation, with support for advanced features like dynamic batching and quantization.
  • Customization: It offers extensive customization options, allowing users to fine-tune models for specific applications.
  • Security: TGI includes robust security features, ensuring data privacy and integrity during inference.

Use Case: Best for organizations requiring high customization and security, such as financial institutions or healthcare providers.

Hardware Requirements:

  • GPUs: NVIDIA V100 32GB or similar
  • VRAM: Minimum 32GB per GPU

Quick Summary

  • Mistral Small 4 offers a unique blend of efficiency and performance, using a Mixture of Experts approach to reduce computational cost while maintaining high accuracy.
  • vLLM is ideal for large-scale deployments, offering high scalability and efficiency.
  • Ollama provides ease of use and cost-effective solutions, making it suitable for businesses with varying technical expertise.
  • TGI excels in high-performance text generation with extensive customization and security features.

For more detailed information on model optimization and quantization, refer to our quantization guide.

  • NVIDIA RTX 5090 GPU — Essential for running large language models like Mistral Small 4, the RTX 5090 provides the necessary GPU power and bandwidth.
  • HP Z8 G4 Workstation — A powerful server option that can handle the computational demands of running inference servers like vLLM, Ollama, and TGI.
  • Samsung 980 Pro NVMe SSD — Crucial for fast data access and storage, this SSD can significantly speed up the loading and processing times of large models.

Frequently Asked Questions

Is Mistral Small 4 really open source?
Yes. It's released under Apache 2.0, which is one of the most permissive licenses available. You can use it commercially, modify it, fine-tune it, and distribute it without restrictions.
How does Mistral Small 4 compare to Llama 3.3 70B?
Small 4 generally outperforms Llama 3.3 70B on coding and reasoning benchmarks while using less compute per token. Llama 3.3 70B is easier to run on a single GPU since it's a dense model with simpler memory requirements. If you have the VRAM for either, Small 4 is the more capable choice.
Can I fine-tune Mistral Small 4?
Yes, the Apache 2.0 license permits fine-tuning. However, fine-tuning a 119B MoE model requires significant resources. Most teams use LoRA adapters to reduce the compute needed. Mistral provides fine-tuning documentation on their platform.
What context window does Mistral Small 4 support?
256K tokens, which is enough for entire codebases or book-length documents. However, running at full context requires substantially more VRAM than shorter context. At Q4 on a 24GB GPU, expect usable context around 4-8K tokens.

🔧 Tools in This Article

All tools →

Related Guides

All guides →