AI Tools

Tencent Covo-Audio: Open-Source 7B Speech AI That Hears and Talks

Tencent released Covo-Audio, a 7B-parameter model that processes audio input and generates audio output within a single architecture. No separate ASR or TTS pipeline needed.

March 16, 2026·6 min read·1,287 words

In short: Tencent Covo-Audio is a 7B open-source (CC BY 4.0) model that handles speech recognition, reasoning, and synthesis in one architecture. It scores 75.30% on MMAU, the highest among 7B audio models tested, and the Chat-FD variant reports 99.7% turn-taking accuracy. It currently focuses on Chinese and English.

Tencent released Covo-Audio — a 7B-parameter model that processes audio input and generates audio output within a single architecture. No separate ASR, LLM, and TTS pipeline. One model handles speech recognition, reasoning, and voice synthesis end-to-end.

It scores 75.30% on the MMAU benchmark, the highest among all 7B-scale audio models tested. The full-duplex variant handles real-time conversation with 99.7% turn-taking accuracy. Everything is open-source under CC BY 4.0.


Why Covo-Audio Matters

Traditional voice AI stacks chain together three separate models: a speech-to-text model, a language model for reasoning, and a text-to-speech model for output. Each handoff loses information — tone, emphasis, emotion. Errors compound across the chain.

Covo-Audio eliminates this by processing audio natively. The model hears continuous audio, reasons about it, and speaks back — all within one forward pass. This means faster responses, fewer errors, and better preservation of acoustic nuance.

For developers building voice assistants, call center automation, or real-time translation tools, this is a meaningful shift from the cascade approach.


Architecture

Covo-Audio combines four components into a unified system:

Component Model Role
Language backbone Qwen2.5-7B Text reasoning and generation
Audio encoder Whisper-large-v3 (50 Hz) Processes incoming audio
Speech tokenizer WavLM-large (25 Hz, 16K codebook) Converts speech to discrete tokens
Audio decoder BigVGAN (24 kHz) Generates audio output

The key technique is Hierarchical Tri-modal Interleaving — the model aligns continuous audio features, discrete speech tokens, and natural language text at both phrase and sentence levels. This preserves both semantic meaning and prosodic detail (pitch, rhythm, emphasis).

An intelligence-speaker decoupling approach separates what the model says from how it sounds. Through multi-speaker training, Covo-Audio adapts to different voice characteristics without retraining.


Model Variants

Variant Use Case Full-Duplex
Covo-Audio Foundation model, embeddings, research No
Covo-Audio-Chat Dialogue, Q&A, voice assistants No
Covo-Audio-Chat-FD Real-time conversation, interruption handling Yes

The Chat-FD variant is the most relevant for production use. It supports true full-duplex interaction — the model can listen while speaking, handle interruptions, and manage turn-taking without explicit voice activity detection.


Benchmarks

Audio Understanding (MMAU-v05.15.25)

Model Params Sound Music Speech Average
Covo-Audio 7B 78.68% 76.05% 71.17% 75.30%
Qwen2.5-Omni 7B 71.50%
Step-Audio 2 32B 77.58%

Covo-Audio beats Qwen2.5-Omni by 3.8 points at the same scale. It comes within 2.3 points of Step-Audio 2, which has roughly 4x the parameters.

Full-Duplex Conversation (MMSU)

Metric Score
Turn-taking success 99.7%
Pause handling 97.6%
Interruption handling 96.81%
Backchanneling 93.89%
Overall (MMSU average) 66.64%

Tencent reports the 66.64% MMSU score as the highest among the models it evaluated, including closed-source systems.

Speech Recognition (ASR)

Test Set Word Error Rate
LibriSpeech clean 1.45%
LibriSpeech other 3.21%
Average 4.71%

Speech Translation (CoVoST2)

Direction BLEU Score
English → Chinese 49.84
Chinese → English 26.77

Hardware Requirements

*Disclosure: some links in this section are affiliate links. If you buy through them, ToolHalla may earn a commission at no extra cost to you.*

Tencent hasn't published official VRAM requirements, so the figures below are estimates based on the 7B architecture, not measured numbers:

  • BF16 (full precision): ~16 GB VRAM — fits on an RTX 4080 16GB or RTX 5080 16GB
  • INT8 quantized: ~8-10 GB — fits on an RTX 4060 Ti 16GB
  • INT4 quantized: ~5-6 GB — may fit on an RTX 3060 12GB with tight margins

For real-time full-duplex inference, expect to need additional headroom. A 16 GB+ card is a sensible minimum for production-quality results until Tencent publishes official numbers.

If you don't have local GPU capacity, cloud GPU providers like Vast.ai rent RTX 4090s on demand — practical for testing or running a voice service without buying hardware.

For help choosing a card, see our best GPUs for running AI locally guide.


Getting Started

Installation


conda create -n covoaudio python=3.11
conda activate covoaudio
git clone https://github.com/Tencent/Covo-Audio.git
cd Covo-Audio
pip install -r requirements.txt

Download the Model


pip install huggingface-hub
huggingface-cli download tencent/Covo-Audio-Chat --local-dir ./covoaudio

For the full-duplex variant:


huggingface-cli download tencent/Covo-Audio-Chat-FD --local-dir ./covoaudio-fd

Run Inference

Check the GitHub repository for inference scripts and example code. The project includes examples for speech recognition, dialogue, and full-duplex conversation. Exact script names and arguments are documented in the repo's README.


Covo-Audio vs GPT-4o Audio vs Gemini 3.1 Flash Live

Feature Covo-Audio GPT-4o Audio Gemini 3.1 Flash Live
Parameters 7B Undisclosed Undisclosed
Open source Yes (CC BY 4.0) No No
Runs locally Yes No No
Full-duplex Yes (Chat-FD) Yes Yes
MMAU score 75.30%
ASR (WER) 4.71% avg
Commercial use Yes Via subscription Via API billing
Languages Chinese, English Many Many

Key trade-off: GPT-4o and Gemini support far more languages and have large training budgets behind them. Covo-Audio is open, runs locally, and posts competitive accuracy at 7B scale — but currently focuses on Chinese and English. If you want a closer look at the hosted real-time option, see our write-up on Gemini 3.1 Flash Live. For a local text-to-speech alternative, compare Voxtral TTS 4B.


Who Should Use Covo-Audio

  • Voice app developers who want to avoid per-minute API costs from hosted providers
  • Researchers studying end-to-end speech models — the CC BY 4.0 license allows modification and redistribution
  • Teams building Chinese/English voice products where data privacy matters and cloud APIs aren't acceptable
  • Hobbyists experimenting with local voice AI on consumer GPUs

If you need many languages or the most polished conversational experience, GPT-4o Audio or Gemini Flash Live are still ahead. If you want open weights, local control, and strong benchmarks at 7B scale, Covo-Audio is one of the better options available today.



Frequently Asked Questions

What makes Covo-Audio different from traditional voice AI stacks?

Covo-Audio processes audio natively in one model instead of chaining a separate speech-to-text model, language model, and text-to-speech model. That avoids the information loss and compounding errors that happen at each handoff in a cascade pipeline.

How does Covo-Audio perform in real-time conversation?

Tencent reports 99.7% turn-taking success, 97.6% pause handling, and 96.81% interruption handling on the MMSU full-duplex benchmark, with an overall MMSU average of 66.64%. The Covo-Audio-Chat-FD variant is the one built for this full-duplex interaction.

Is Covo-Audio suitable for voice assistants or call center automation?

It is a reasonable fit for those use cases because of its unified architecture and full-duplex variant, with the main caveat that it currently focuses on Chinese and English. Teams needing broad multilingual coverage may still prefer a hosted provider.

What is the license for Covo-Audio?

Covo-Audio is open-source under CC BY 4.0, which allows use, modification, and redistribution with attribution.

What are the alternatives to Covo-Audio?

For hosted, closed-source real-time voice, GPT-4o Audio and Gemini 3.1 Flash Live support more languages. At the same 7B scale, Qwen2.5-Omni is the closest open comparison (Covo-Audio scores higher on MMAU). For local text-to-speech specifically, see Voxtral TTS 4B.

What does Covo-Audio cost to run?

The model weights are free under CC BY 4.0, so the cost is the GPU you run it on — local hardware or a rented cloud GPU. Tencent has not published official VRAM requirements, so size your GPU conservatively until it does.


*Disclosure: this article contains affiliate links. If you buy through the Amazon links or sign up via the Vast.ai referral link, ToolHalla may earn a commission at no extra cost to you. Need a GPU to run Covo-Audio? See our best GPUs for running AI locally guide or browse RTX 5080 listings on Amazon.*

Frequently Asked Questions

What makes Covo-Audio different from traditional voice AI stacks?
Covo-Audio processes audio natively in one model instead of chaining a separate speech-to-text model, language model, and text-to-speech model. That avoids the information loss and compounding errors that happen at each handoff in a cascade pipeline.
How does Covo-Audio perform in real-time conversation?
Tencent reports 99.7% turn-taking success, 97.6% pause handling, and 96.81% interruption handling on the MMSU full-duplex benchmark, with an overall MMSU average of 66.64%. The Covo-Audio-Chat-FD variant is the one built for this full-duplex interaction.
Is Covo-Audio suitable for voice assistants or call center automation?
It is a reasonable fit for those use cases because of its unified architecture and full-duplex variant, with the main caveat that it currently focuses on Chinese and English. Teams needing broad multilingual coverage may still prefer a hosted provider.
What is the license for Covo-Audio?
Covo-Audio is open-source under CC BY 4.0, which allows use, modification, and redistribution with attribution.
What are the alternatives to Covo-Audio?
For hosted, closed-source real-time voice, GPT-4o Audio and Gemini 3.1 Flash Live support more languages. At the same 7B scale, Qwen2.5-Omni is the closest open comparison (Covo-Audio scores higher on MMAU). For local text-to-speech specifically, see Voxtral TTS 4B.
What does Covo-Audio cost to run?
The model weights are free under CC BY 4.0, so the cost is the GPU you run it on — local hardware or a rented cloud GPU. Tencent has not published official VRAM requirements, so size your GPU conservatively until it does. --- Disclosure: this article contains affiliate links. If you buy through the Amazon links or sign up via the Vast.ai referral link, ToolHalla may earn a commission at no extra cost to you. Need a GPU to run Covo-Audio? See our best GPUs for running AI locally guide or browse RTX 5080 listings on Amazon.

🔧 Tools in This Article

All tools →

Related Guides

All guides →
#Tencent#Covo-Audio#speech AI#open source#text-to-speech#ASR#voice AI#local AI