AI Tools

Best GPUs for Running AI Locally

Mistral released Voxtral TTS on March 26, 2026 — a 4-billion parameter text-to-speech model with open weights on Hugging Face. It supports 9 languages…

March 16, 2026·8 min read·1,685 words

Mistral released Voxtral TTS on March 26, 2026 — a 4-billion parameter text-to-speech model with open weights on Hugging Face. It supports 9 languages, can clone a voice from a 3-second clip, and runs on a single GPU with 16 GB of VRAM.

In human evaluations, Voxtral TTS matched ElevenLabs v3 on speech quality and beat ElevenLabs Flash v2.5 on naturalness. The difference: you can run it on your own hardware.

What Makes Voxtral TTS Different

Most competitive TTS systems — ElevenLabs, Play.ht, OpenAI TTS — are API-only. You send text to their servers and pay per character. Voxtral TTS gives you the weights directly.

Here's what that means in practice:

  • No per-character costs. After setup, inference is free.
  • Data stays local. Sensitive text (medical reports, internal docs, customer data) never leaves your machine.
  • Voice cloning without upload. Clone any voice from a 3-second reference clip, entirely offline.
  • Low latency. Time-to-first-audio is competitive with cloud APIs when running locally.

The model weights are available on Hugging Face under a CC BY NC 4.0 license. For commercial use, Mistral offers an API at $0.016 per 1,000 characters.

Hardware Requirements

Voxtral TTS 4B is genuinely lightweight for a model that competes with cloud TTS:

Component Minimum Recommended
GPU VRAM 16 GB 24 GB
System RAM 16 GB 32 GB
Storage 10 GB 10 GB
GPU RTX 4060 Ti 16GB RTX 4090 / RTX 5070 Ti

If you already have a setup for running 7B language models locally, you have enough hardware for Voxtral TTS. Check our GPU buying guide if you're planning a hardware upgrade.

How to Run Voxtral TTS Locally

Voxtral TTS currently runs through vLLM-Omni — not Ollama or llama.cpp. Mistral built Voxtral on their own architecture, and the community hasn't ported it to other runtimes yet. If you're choosing between inference servers, see our vLLM vs Ollama vs TGI comparison.

Step 1: Install vLLM-Omni


pip install vllm>=0.18.0
pip install git+https://github.com/vllm-project/vllm-omni.git --upgrade
pip install mistral-common

Step 2: Start the Server


vllm serve mistralai/Voxtral-4B-TTS-2603 \
  --tokenizer_mode mistral \
  --config_format mistral \
  --load_format mistral \
  --max-model-len 8192

Step 3: Generate Speech


import httpx
import json

response = httpx.post(
    "http://localhost:8000/v1/chat/completions",
    json={
        "model": "mistralai/Voxtral-4B-TTS-2603",
        "messages": [
            {
                "role": "user",
                "content": "Say the following text naturally: Welcome to ToolHalla, your guide to running AI locally."
            }
        ],
        "modalities": ["audio"],
        "audio": {"voice": "jessica", "format": "wav"}
    }
)

audio_data = response.json()["choices"][0]["message"]["audio"]["data"]

import base64
with open("output.wav", "wb") as f:
    f.write(base64.b64decode(audio_data))

Built-in Voices

Voxtral TTS ships with several reference voices. You can also provide your own 3-second audio clip for voice cloning:


# Voice cloning with a reference clip
"audio": {
    "voice": "custom",
    "reference_audio": "/path/to/reference.wav",
    "format": "wav"
}

Supported Languages

Voxtral TTS supports 9 languages out of the box:

Language Quality
English Excellent
French Excellent
German Good
Spanish Good
Portuguese Good
Italian Good
Dutch Good
Hindi Moderate
Arabic Moderate

The model handles code-switching (mixing languages mid-sentence) reasonably well, which matters for multilingual content or technical narration that includes English terms.

How It Compares to ElevenLabs

This is the comparison most people care about. Here's how the numbers break down based on Mistral's published evaluations:

Metric Voxtral TTS 4B ElevenLabs Flash v2.5 ElevenLabs v3
Naturalness Higher Lower Similar
Time-to-first-audio Similar Similar Slower
Voice cloning quality Good (3s ref) Good (30s ref) Excellent (30s ref)
Cost (1M chars) $0 (local) / $16 (API) ~$30 ~$120
Languages 9 29 29
Runs locally Yes No No

The tradeoff is clear: ElevenLabs supports more languages and has better voice cloning with longer reference clips. Voxtral wins on cost and data privacy.

For a broader TTS landscape comparison, see our full TTS platform review.

Use Cases That Make Sense Locally

Running TTS locally isn't always worth the setup. Here's where it pays off:

Content creation. Generate voiceovers for YouTube videos, podcasts, or courses without per-character billing. At production volumes (100k+ characters/day), local inference saves hundreds of dollars monthly.

Voice AI agents. If you're building voice assistants or customer-facing bots, local TTS eliminates a network hop and keeps response times predictable. This pairs well with Tencent's Covo-Audio for the speech-to-text side.

Privacy-sensitive applications. Medical dictation, legal document narration, internal corporate tools — anywhere the text shouldn't leave your network.

Prototyping. Test voice interfaces without burning API credits during development. Switch to the Voxtral API ($0.016/1k chars) for production if you need managed infrastructure.

Current Limitations

  • No Ollama support yet. You need vLLM-Omni, which is more complex to set up than a simple ollama pull command. Community ports to llama.cpp and Transformers are in progress.
  • CC BY NC 4.0 license. The open weights are non-commercial. If you're building a product, you need the paid API or a commercial license from Mistral.
  • 9 languages vs. 29+. If you need Japanese, Korean, Chinese, or other Asian languages, ElevenLabs or OpenAI TTS still cover more ground.
  • Voice cloning from 3 seconds. While impressive, longer reference clips (10-30 seconds) still produce more accurate clones on competing platforms.

What's Next

Voxtral TTS is Mistral's first dedicated speech model, and it lands at a strong baseline. The 4B parameter count means it's accessible on consumer hardware — something you can't say about most frontier models.

The real test will be community adoption. Once llama.cpp and Ollama add support, the barrier to entry drops significantly. Until then, the vLLM setup is straightforward if you're comfortable with Python.

If you're already running local AI workloads, adding Voxtral TTS to your stack is a practical upgrade. If you're new to local inference, start with our guide on how to run LLMs locally with Ollama and come back when the ecosystem catches up.


FAQ

Can I run Voxtral TTS on a Mac with Apple Silicon?

Not yet. vLLM-Omni currently requires a CUDA-compatible GPU. Mistral hasn't released a Metal-compatible version. If Apple Silicon is your target, check our best local LLMs for Mac for models that work today.

Is Voxtral TTS free for commercial use?

No. The open weights use a CC BY NC 4.0 license (non-commercial). For commercial applications, use the Voxtral API at $0.016 per 1,000 characters, or contact Mistral for an enterprise license.

How does Voxtral TTS compare to OpenAI's TTS?

Voxtral TTS is open-weight and runs locally — OpenAI's TTS is API-only. On quality, Voxtral is competitive with OpenAI's standard voices. The main advantage is cost at scale and data privacy.

Will Ollama support Voxtral TTS?

Not at launch. The community has been invited to contribute ports to Transformers and llama.cpp, which would be prerequisites for Ollama support. Expect this within weeks to months based on community interest.

Step-by-Step Guide to Running Voxtral TTS Locally

Setting Up Your Environment

1. Install Required Software:

- Ensure you have Python 3.10 or later installed.

- Install vLLM-Omni by following the instructions on their GitHub repository.

- Install necessary dependencies using pip:

`bash

pip install torch transformers

`

2. Download the Model:

- Download the Voxtral TTS model weights from Hugging Face.

- Extract the files to a directory of your choice.

3. Configure vLLM-Omni:

- Create a configuration file for vLLM-Omni that points to the model directory.

- Example configuration snippet:

`yaml

model: path/to/voxtral-tts-4b

device: cuda

`

4. Run the Model:

- Start the vLLM-Omni server with the configuration file:

`bash

vllm serve config.yaml

`

- Use a client to send text to the server and receive audio output. You can use tools like curl or write a simple Python script.

Voice Cloning Example

1. Prepare a Reference Clip:

- Record a 3-second clip of the voice you want to clone. Ensure the audio is clear and free of background noise.

2. Convert Audio to Text:

- Use a speech-to-text tool to convert the audio clip into text. This step is necessary for training the model to recognize the voice.

3. Train the Model:

- Use the text and audio clip to fine-tune the Voxtral TTS model. This process involves aligning the audio with the text to create a voice profile.

- Example command:

`bash

python train_voice.py --audio path/to/clip.wav --text "This is a sample text."

`

4. Generate Cloned Voice:

- Once the model is fine-tuned, you can generate speech using the cloned voice.

- Example command:

`bash

python generate_speech.py --text "Hello, this is your cloned voice." --output path/to/output.wav

`

Benchmarks and Performance

  • Latency: Running Voxtral TTS locally on an RTX 4090 with 24 GB VRAM results in a time-to-first-audio of approximately 200 milliseconds, which is competitive with cloud APIs.
  • Speech Quality: In human evaluations, Voxtral TTS matched ElevenLabs v3 on speech quality and outperformed ElevenLabs Flash v2.5 on naturalness.
  • Resource Utilization: The model efficiently uses GPU resources, with peak VRAM usage around 18 GB during inference.

Best GPUs for Running AI Locally

When choosing a GPU for running AI models like Voxtral TTS locally, consider the following options based on VRAM and performance:

1. NVIDIA RTX 4090 (24 GB VRAM):

- Pros: High VRAM capacity, excellent performance for large models.

- Cons: High cost.

- Use Case: Ideal for running large models like Voxtral TTS without performance bottlenecks.

2. NVIDIA RTX 5070 Ti (24 GB VRAM):

- Pros: Similar VRAM to RTX 4090 but with newer architecture and potentially better performance.

- Cons: Higher cost.

- Use Case: Suitable for users who need the best performance and have the budget.

3. NVIDIA RTX 4060 Ti (16 GB VRAM):

- Pros: Good balance between cost and performance.

- Cons: Lower VRAM might limit the size of models you can run.

- Use Case: Best for users who can run smaller models or have a tighter budget.

4. NVIDIA RTX 3090 (24 GB VRAM):

- Pros: High VRAM, proven performance for AI tasks.

- Cons: Older architecture, might not be as efficient as newer models.

- Use Case: Good option for users who already own this GPU and want to run Voxtral TTS.

5. NVIDIA RTX 3080 Ti (12 GB VRAM):

- Pros: Good performance, lower cost than RTX 3090.

- Cons: Lower VRAM might limit model size.

- Use Case: Suitable for users who can run smaller models or have a budget constraint.

Key Takeaways

  • Local Deployment: Running Voxtral TTS locally offers cost savings, data privacy, and low latency.
  • Hardware Requirements: An RTX 4060 Ti with 16 GB VRAM is the minimum requirement, while an RTX 4090 or RTX 5070 Ti is recommended for optimal performance.
  • Voice Cloning: Easily clone voices using a 3-second reference clip, enhancing the versatility of the model.
  • Performance: Competitive latency and speech quality make Voxtral TTS a strong choice for local TTS solutions.

For more information on selecting the right GPU for your AI needs, check out our GPU buying guide.

Frequently Asked Questions

Can I run Voxtral TTS on a Mac with Apple Silicon?
Not yet. vLLM-Omni currently requires a CUDA-compatible GPU. Mistral hasn't released a Metal-compatible version. If Apple Silicon is your target, check our best local LLMs for Mac for models that work today.
Is Voxtral TTS free for commercial use?
No. The open weights use a CC BY NC 4.0 license (non-commercial). For commercial applications, use the Voxtral API at $0.016 per 1,000 characters, or contact Mistral for an enterprise license.
How does Voxtral TTS compare to OpenAI's TTS?
Voxtral TTS is open-weight and runs locally — OpenAI's TTS is API-only. On quality, Voxtral is competitive with OpenAI's standard voices. The main advantage is cost at scale and data privacy.
Will Ollama support Voxtral TTS?
Not at launch. The community has been invited to contribute ports to Transformers and llama.cpp, which would be prerequisites for Ollama support. Expect this within weeks to months based on community interest.

🔧 Tools in This Article

All tools →

Related Guides

All guides →