Current Position:fig. beginning " AI Tool

CosyVoice: Ali open source multilingual cloning and generation tools

2025-07-09

524 4

https://github.com/FunAudioLLM/CosyVoice

CosyVoice is an open source multilingual speech generation model that focuses on high-quality text-to-speech (TTS) technology. It supports speech synthesis in multiple languages and provides features such as zero-sample speech generation, cross-language speech cloning and fine-grained sentiment control.Cos...

yVoice 2.0 significantly reduces the pronunciation error from 30% to 50% compared to the previous version, and the sound quality and rhythmic naturalness are greatly improved, with the MOS score increasing from 5.4 to 5.53. It achieves low-latency streaming and non-streaming speech synthesis through simplified model architecture and optimized algorithms for real-time interaction scenarios. The project provides complete inference, training and deployment support, and developers can easily get started with it, making it suitable for applications such as voice assistants, dubbing and multilingual content creation.

Open Services:CosyVoice: 3-second rush voice cloning open source project launched by Ali with support for emotionally controlled tags

Function List

Zero-sample speech generation: generates speech similar to the target voice based on short audio samples without additional training.
Cross-Language Speech Synthesis: Supports multi-language speech generation, maintains timbre consistency, and is suitable for global content creation.
Fine-grained emotion control: supports adding laughter, pauses and other emotional expressions to generate more natural speech.
Dialect and Accent Adjustment: Generate speech in specific dialects (e.g., Sichuan) or accents to enhance the localization experience.
Streaming Speech Synthesis: Supports low-latency streaming output with first-packet latency as low as 150ms, suitable for real-time applications.
Model Training Support: Provides a complete process of pre-training models and training from scratch to meet developer needs.
HIGH SOUND OUTPUT: Optimized sound quality and rhythm, with a MOS score of 5.53, close to the commercial level.

Using Help

Installation process

To use CosyVoice, you first need to install the necessary environment and dependencies. Below are detailed installation steps to ensure that users can quickly configure and run the project.

Installing Conda
CosyVoice recommends using Conda to manage your environment. Please visit https://docs.conda.io/en/latest/miniconda.html Download and install Miniconda. then create and activate a virtual environment:
```
conda create -n cosyvoice python=3.10
conda activate cosyvoice
```

Installation of dependencies
Use pip to install the Python packages required by the project. For network reasons, it is recommended to use a domestic mirror source (e.g. Aliyun):
```
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
```
mounting pynini(for text processing):
```
conda install -y -c conda-forge pynini==2.1.5
```
Solving sox compatibility issues
If you are experiencing sox-related problems, install sox according to your operating system:
- Ubuntu::
```
sudo apt-get install sox libsox-dev
```
- CentOS::
```
sudo yum install sox sox-devel
```
Download pre-trained model
CosyVoice provides several pre-trained models (e.g. CosyVoice2-0.5B,CosyVoice-300M,CosyVoice-300M-SFT cap (a poem) CosyVoice-300M-Instruct). Users can download these models from the GitHub repository and place them in the pretrained_models directory. Also download CosyVoice-ttsfrd resources to support text front-end processing.

Cloning Project Code
Clone the CosyVoice repository using Git and make sure the submodules are loaded correctly:

git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git
cd CosyVoice
git submodule update --init --recursive

Usage

CosyVoice provides a variety of functional modules, including zero-sample speech generation, cross-language synthesis, and imperative speech generation. Here is how it works.

Zero-sample speech generation

Zero-sample speech generation allows the user to generate speech with a target timbre based on short audio samples. For example, to generate a speech with a specific timbre:

from cosyvoice import CosyVoice2
import torchaudio
cosyvoice = CosyVoice2('pretrained_models/CosyVoice2-0.5B', load_jit=False, load_trt=False, fp16=False)
prompt_speech_16k = torchaudio.load('./asset/zero_shot_prompt.wav')[0]
for i, j in enumerate(cosyvoice.inference_zero_shot(
'收到好友从远方寄来的生日礼物，那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐，笑容如花儿般绽放。',
'希望你以后能够做的比我还好呦。',
prompt_speech_16k,
stream=False)):
torchaudio.save(f'zero_shot_{i}.wav', j['tts_speech'], cosyvoice.sample_rate)

procedure::
1. Prepare a 16kHz cue audio (e.g. zero_shot_prompt.wav).
2. invocations inference_zero_shot method, passing in the target text, cue text, and audio.
3. Save the generated voice file.
take note of: To reproduce the results of the official website, please set the text_frontend=FalseThe

cross-language speech synthesis

Cross-language synthesis supports the generation of multilingual speech with consistent timbre. For example, generating speech with laughter:

for i, j in enumerate(cosyvoice.inference_cross_lingual(
'在他讲述那个荒诞故事的过程中，他突然[laughter]停下来，因为他自己也被逗笑了[laughter]。',
prompt_speech_16k,
stream=False)):
torchaudio.save(f'fine_grained_control_{i}.wav', j['tts_speech'], cosyvoice.sample_rate)

procedure::
1. Prepare the cue audio.
2. Add sentiment tags to the text (e.g. [laughter]), for specific supported tags see cosyvoice/tokenizer/tokenizer.py#L248The
3. Save the generated voice file.

Command-based speech generation

Command Generation supports specifying a dialect or style. For example, generating Sichuanese speech:

for i, j in enumerate(cosyvoice.inference_instruct2(
'收到好友从远方寄来的生日礼物，那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐，笑容如花儿般绽放。',
'用四川话说这句话',
prompt_speech_16k,
stream=False)):
torchaudio.save(f'instruct_{i}.wav', j['tts_speech'], cosyvoice.sample_rate)

procedure::
1. Provide target text and instructions (e.g., "Speak in Sichuan").
2. Generate speech using prompted audio.
3. Save the output file.

streaming synthesis

Streaming synthesis is suitable for real-time applications, with first-packet latency as low as 150ms. settings stream=True You can enable the streaming output, refer to the above example for the specific code.

Other considerations

Model Selection: Recommended use CosyVoice2-0.5B model for optimal sound quality and performance.
Environment Configuration: Ensure that Python version is 3.10 and that dependencies are installed correctly.
adjust components during testing: If the cloning of the submodule fails, run it multiple times git submodule update --init --recursive Until it works.
documentation reference: For more details see cosyvoice/tokenizer/tokenizer.py and official documentation https://funaudiollm.github.io/cosyvoice2The

application scenario

Voice assistant development
CosyVoice's low-latency streaming synthesis and multi-language support are ideal for developing intelligent voice assistants. Developers can quickly customize personalized tones using zero-sample generation to generate natural and smooth voice responses that enhance the user interaction experience.
Multilingual content creation
Content creators can use cross-language synthesis to quickly generate multilingual voiceovers. For example, generate voices for videos or podcasts in English, Chinese, or other languages to maintain a consistent tone and reduce production costs.
Education and Language Learning
CosyVoice supports dialect and emotion control and can be used in language learning applications to generate speech with a specific accent or emotion to help learners practice listening and pronunciation.
Movie, TV & Game Dubbing
Film, TV and game developers can utilize fine-grained emotion control to generate speech with effects such as laughter and pauses to enhance character expression and improve the immersion of their work.

QA

What languages does CosyVoice support?
CosyVoice supports speech synthesis in multiple languages, a list of which can be found in the official documentation or the cosyvoice/tokenizer/tokenizer.py View in, covering common languages and some dialects.
How to reduce the latency of speech generation?
Using the streaming synthesis mode (stream=True) can reduce first-packet latency to 150ms. ensure that the use of high-performance hardware and optimized models (such as the CosyVoice2-0.5B).
Is it necessary to train the model?
No. CosyVoice provides pre-trained models that can be downloaded and used directly. If you need to customize it, you can refer to the official documentation to train it from scratch.
How do I add a custom tone?
utilization inference_zero_shot method, pass in the target audio sample and cue text to generate a custom tone. Saving the tone information can be done by calling the cosyvoice.add_zero_shot_spkThe

AI productivity tools " CosyVoice: Ali open source multilingual cloning and generation tools Posted on 2025-07-09, if you find the URL is out of date, or inaccessible, please contact us.

0Bookmarked

0kudos

CosyVoice: Ali open source multilingual cloning and generation tools

Function List

Using Help

Installation process

Usage

Zero-sample speech generation

cross-language speech synthesis

Command-based speech generation

streaming synthesis

Other considerations

application scenario

QA

Related articles

Recommended

Can't find AI tools? Try here!

Popular AI tools

New Releases

Latest AI tools

CosyVoice: Ali open source multilingual cloning and generation tools

Function List

Using Help

Installation process

Usage

Zero-sample speech generation

cross-language speech synthesis

Command-based speech generation

streaming synthesis

Other considerations

application scenario

QA

Related articles

Recommended

Can't find AI tools? Try here!

Popular AI tools

New Releases

Latest AI tools

Quick query station AI tool