Tencent Covo-Audio: Open-Source 7B Speech AI That Hears and Talks
Tencent released Covo-Audio, a 7B-parameter model that processes audio input and generates audio output within a single architecture. No separate ASR or TTS pipeline needed.
In short: Tencent Covo-Audio is a 7B open-source (CC BY 4.0) model that handles speech recognition, reasoning, and synthesis in one architecture. It scores 75.30% on MMAU, the highest among 7B audio models tested, and the Chat-FD variant reports 99.7% turn-taking accuracy. It currently focuses on Chinese and English.
Tencent released Covo-Audio — a 7B-parameter model that processes audio input and generates audio output within a single architecture. No separate ASR, LLM, and TTS pipeline. One model handles speech recognition, reasoning, and voice synthesis end-to-end.
It scores 75.30% on the MMAU benchmark, the highest among all 7B-scale audio models tested. The full-duplex variant handles real-time conversation with 99.7% turn-taking accuracy. Everything is open-source under CC BY 4.0.
Why Covo-Audio Matters
Traditional voice AI stacks chain together three separate models: a speech-to-text model, a language model for reasoning, and a text-to-speech model for output. Each handoff loses information — tone, emphasis, emotion. Errors compound across the chain.
Covo-Audio eliminates this by processing audio natively. The model hears continuous audio, reasons about it, and speaks back — all within one forward pass. This means faster responses, fewer errors, and better preservation of acoustic nuance.
For developers building voice assistants, call center automation, or real-time translation tools, this is a meaningful shift from the cascade approach.
Architecture
Covo-Audio combines four components into a unified system:
| Component | Model | Role |
|---|---|---|
| Language backbone | Qwen2.5-7B | Text reasoning and generation |
| Audio encoder | Whisper-large-v3 (50 Hz) | Processes incoming audio |
| Speech tokenizer | WavLM-large (25 Hz, 16K codebook) | Converts speech to discrete tokens |
| Audio decoder | BigVGAN (24 kHz) | Generates audio output |
The key technique is Hierarchical Tri-modal Interleaving — the model aligns continuous audio features, discrete speech tokens, and natural language text at both phrase and sentence levels. This preserves both semantic meaning and prosodic detail (pitch, rhythm, emphasis).
An intelligence-speaker decoupling approach separates what the model says from how it sounds. Through multi-speaker training, Covo-Audio adapts to different voice characteristics without retraining.
Model Variants
| Variant | Use Case | Full-Duplex |
|---|---|---|
| Covo-Audio | Foundation model, embeddings, research | No |
| Covo-Audio-Chat | Dialogue, Q&A, voice assistants | No |
| Covo-Audio-Chat-FD | Real-time conversation, interruption handling | Yes |
The Chat-FD variant is the most relevant for production use. It supports true full-duplex interaction — the model can listen while speaking, handle interruptions, and manage turn-taking without explicit voice activity detection.
Benchmarks
Audio Understanding (MMAU-v05.15.25)
| Model | Params | Sound | Music | Speech | Average |
|---|---|---|---|---|---|
| Covo-Audio | 7B | 78.68% | 76.05% | 71.17% | 75.30% |
| Qwen2.5-Omni | 7B | — | — | — | 71.50% |
| Step-Audio 2 | 32B | — | — | — | 77.58% |
Covo-Audio beats Qwen2.5-Omni by 3.8 points at the same scale. It comes within 2.3 points of Step-Audio 2, which has roughly 4x the parameters.
Full-Duplex Conversation (MMSU)
| Metric | Score |
|---|---|
| Turn-taking success | 99.7% |
| Pause handling | 97.6% |
| Interruption handling | 96.81% |
| Backchanneling | 93.89% |
| Overall (MMSU average) | 66.64% |
Tencent reports the 66.64% MMSU score as the highest among the models it evaluated, including closed-source systems.
Speech Recognition (ASR)
| Test Set | Word Error Rate |
|---|---|
| LibriSpeech clean | 1.45% |
| LibriSpeech other | 3.21% |
| Average | 4.71% |
Speech Translation (CoVoST2)
| Direction | BLEU Score |
|---|---|
| English → Chinese | 49.84 |
| Chinese → English | 26.77 |
Hardware Requirements
*Disclosure: some links in this section are affiliate links. If you buy through them, ToolHalla may earn a commission at no extra cost to you.*
Tencent hasn't published official VRAM requirements, so the figures below are estimates based on the 7B architecture, not measured numbers:
- BF16 (full precision): ~16 GB VRAM — fits on an RTX 4080 16GB or RTX 5080 16GB
- INT8 quantized: ~8-10 GB — fits on an RTX 4060 Ti 16GB
- INT4 quantized: ~5-6 GB — may fit on an RTX 3060 12GB with tight margins
For real-time full-duplex inference, expect to need additional headroom. A 16 GB+ card is a sensible minimum for production-quality results until Tencent publishes official numbers.
If you don't have local GPU capacity, cloud GPU providers like Vast.ai rent RTX 4090s on demand — practical for testing or running a voice service without buying hardware.
For help choosing a card, see our best GPUs for running AI locally guide.
Getting Started
Installation
conda create -n covoaudio python=3.11
conda activate covoaudio
git clone https://github.com/Tencent/Covo-Audio.git
cd Covo-Audio
pip install -r requirements.txt
Download the Model
pip install huggingface-hub
huggingface-cli download tencent/Covo-Audio-Chat --local-dir ./covoaudio
For the full-duplex variant:
huggingface-cli download tencent/Covo-Audio-Chat-FD --local-dir ./covoaudio-fd
Run Inference
Check the GitHub repository for inference scripts and example code. The project includes examples for speech recognition, dialogue, and full-duplex conversation. Exact script names and arguments are documented in the repo's README.
Covo-Audio vs GPT-4o Audio vs Gemini 3.1 Flash Live
| Feature | Covo-Audio | GPT-4o Audio | Gemini 3.1 Flash Live |
|---|---|---|---|
| Parameters | 7B | Undisclosed | Undisclosed |
| Open source | Yes (CC BY 4.0) | No | No |
| Runs locally | Yes | No | No |
| Full-duplex | Yes (Chat-FD) | Yes | Yes |
| MMAU score | 75.30% | — | — |
| ASR (WER) | 4.71% avg | — | — |
| Commercial use | Yes | Via subscription | Via API billing |
| Languages | Chinese, English | Many | Many |
Key trade-off: GPT-4o and Gemini support far more languages and have large training budgets behind them. Covo-Audio is open, runs locally, and posts competitive accuracy at 7B scale — but currently focuses on Chinese and English. If you want a closer look at the hosted real-time option, see our write-up on Gemini 3.1 Flash Live. For a local text-to-speech alternative, compare Voxtral TTS 4B.
Who Should Use Covo-Audio
- Voice app developers who want to avoid per-minute API costs from hosted providers
- Researchers studying end-to-end speech models — the CC BY 4.0 license allows modification and redistribution
- Teams building Chinese/English voice products where data privacy matters and cloud APIs aren't acceptable
- Hobbyists experimenting with local voice AI on consumer GPUs
If you need many languages or the most polished conversational experience, GPT-4o Audio or Gemini Flash Live are still ahead. If you want open weights, local control, and strong benchmarks at 7B scale, Covo-Audio is one of the better options available today.
Links
Frequently Asked Questions
What makes Covo-Audio different from traditional voice AI stacks?
Covo-Audio processes audio natively in one model instead of chaining a separate speech-to-text model, language model, and text-to-speech model. That avoids the information loss and compounding errors that happen at each handoff in a cascade pipeline.
How does Covo-Audio perform in real-time conversation?
Tencent reports 99.7% turn-taking success, 97.6% pause handling, and 96.81% interruption handling on the MMSU full-duplex benchmark, with an overall MMSU average of 66.64%. The Covo-Audio-Chat-FD variant is the one built for this full-duplex interaction.
Is Covo-Audio suitable for voice assistants or call center automation?
It is a reasonable fit for those use cases because of its unified architecture and full-duplex variant, with the main caveat that it currently focuses on Chinese and English. Teams needing broad multilingual coverage may still prefer a hosted provider.
What is the license for Covo-Audio?
Covo-Audio is open-source under CC BY 4.0, which allows use, modification, and redistribution with attribution.
What are the alternatives to Covo-Audio?
For hosted, closed-source real-time voice, GPT-4o Audio and Gemini 3.1 Flash Live support more languages. At the same 7B scale, Qwen2.5-Omni is the closest open comparison (Covo-Audio scores higher on MMAU). For local text-to-speech specifically, see Voxtral TTS 4B.
What does Covo-Audio cost to run?
The model weights are free under CC BY 4.0, so the cost is the GPU you run it on — local hardware or a rented cloud GPU. Tencent has not published official VRAM requirements, so size your GPU conservatively until it does.
*Disclosure: this article contains affiliate links. If you buy through the Amazon links or sign up via the Vast.ai referral link, ToolHalla may earn a commission at no extra cost to you. Need a GPU to run Covo-Audio? See our best GPUs for running AI locally guide or browse RTX 5080 listings on Amazon.*
Frequently Asked Questions
What makes Covo-Audio different from traditional voice AI stacks?
How does Covo-Audio perform in real-time conversation?
Is Covo-Audio suitable for voice assistants or call center automation?
What is the license for Covo-Audio?
What are the alternatives to Covo-Audio?
What does Covo-Audio cost to run?
🔧 Tools in This Article
All tools →Related Guides
All guides →Jan vs GPT4All vs LocalAI: Best Desktop AI App 2026
Jan vs GPT4All vs LocalAI: Best Desktop AI App 2026 You don't need a ChatGPT subscription to run a capable AI assistant in 2026. Three desktop apps — Jan, GPT4All, and LocalAI — let you download and run large language models completely offline, with no monthly fees, no data sent to the cloud, and no usage limits. They're all free, open source, and support the same popular models like Llama 3.3,
10 min read
Local LLMQwen3.6-27B for local coding: useful small tasks, review still wins
Georgi Gerganov says Qwen3.6-27B has helped with small ggml-org maintainer tasks locally. Treat that as useful operator evidence, not permission to skip review.
8 min read
Local LLMMiniMax M3 VRAM requirements: workstation-class memory
MiniMax M3 is open weight with 428B total parameters and 23B active parameters. That makes it a serious local-inference story — but not a casual desktop model. Here is the practical VRAM and quantization picture.
8 min read