CosyVoice is an open source multilingual speech generation model focused on high-quality text-to-speech (TTS) technology. It supports speech synthesis in multiple languages and provides features such as zero-sample speech generation, cross-language speech cloning and fine-grained emotion control.Cos-
yVoice 2.0 significantly reduces the pronunciation error from 30% to 50% compared to the previous version, and the sound quality and rhythmic naturalness are greatly improved, with the MOS score increasing from 5.4 to 5.53. It achieves low-latency streaming and non-streaming speech synthesis through simplified model architecture and optimized algorithms for real-time interaction scenarios. The project provides complete inference, training and deployment support, and developers can easily get started with it, making it suitable for applications such as voice assistants, dubbing and multilingual content creation.
Open Services:CosyVoice: 3-second rush voice cloning open source project launched by Ali with support for emotionally controlled tags
Function List
- Zero-sample speech generation: generates speech similar to the target voice based on short audio samples without additional training.
- Cross-Language Speech Synthesis: Supports multi-language speech generation, maintains timbre consistency, and is suitable for global content creation.
- Fine-grained emotion control: supports adding laughter, pauses and other emotional expressions to generate more natural speech.
- Dialect and Accent Adjustment: Generate speech in specific dialects (e.g., Sichuan) or accents to enhance the localization experience.
- Streaming Speech Synthesis: Supports low-latency streaming output with first-packet latency as low as 150ms, suitable for real-time applications.
- Model Training Support: Provides a complete process of pre-training models and training from scratch to meet developer needs.
- HIGH SOUND OUTPUT: Optimized sound quality and rhythm, with a MOS score of 5.53, close to the commercial level.
Using Help
Installation process
To use CosyVoice, you first need to install the necessary environment and dependencies. Below are detailed installation steps to ensure that users can quickly configure and run the project.
- Installing Conda
CosyVoice recommends using Conda to manage your environment. Please visithttps://docs.conda.io/en/latest/miniconda.html
Download and install Miniconda. then create and activate a virtual environment:conda create -n cosyvoice python=3.10 conda activate cosyvoice
- Installation of dependencies
Use pip to install the Python packages required by the project. For network reasons, it is recommended to use a domestic mirror source (e.g. Aliyun):pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
mounting
pynini
(for text processing):conda install -y -c conda-forge pynini==2.1.5
- Solving sox compatibility issues
If you are experiencing sox-related problems, install sox according to your operating system:- Ubuntu::
sudo apt-get install sox libsox-dev
- CentOS::
sudo yum install sox sox-devel
- Ubuntu::
- Download pre-trained model
CosyVoice provides several pre-trained models (e.g.CosyVoice2-0.5B
,CosyVoice-300M
,CosyVoice-300M-SFT
cap (a poem)CosyVoice-300M-Instruct
). Users can download these models from the GitHub repository and place them in thepretrained_models
directory. Also downloadCosyVoice-ttsfrd
resources to support text front-end processing. - Cloning Project Code
Clone the CosyVoice repository using Git and make sure the submodules are loaded correctly:git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git cd CosyVoice git submodule update --init --recursive
Usage
CosyVoice provides a variety of functional modules, including zero-sample speech generation, cross-language synthesis, and imperative speech generation. Here is how it works.
Zero-sample speech generation
Zero-sample speech generation allows the user to generate speech with a target timbre based on short audio samples. For example, to generate a speech with a specific timbre:
from cosyvoice import CosyVoice2
import torchaudio
cosyvoice = CosyVoice2('pretrained_models/CosyVoice2-0.5B', load_jit=False, load_trt=False, fp16=False)
prompt_speech_16k = torchaudio.load('./asset/zero_shot_prompt.wav')[0]
for i, j in enumerate(cosyvoice.inference_zero_shot(
'收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。',
'希望你以后能够做的比我还好呦。',
prompt_speech_16k,
stream=False)):
torchaudio.save(f'zero_shot_{i}.wav', j['tts_speech'], cosyvoice.sample_rate)
- procedure::
- Prepare a 16kHz cue audio (e.g.
zero_shot_prompt.wav
). - invocations
inference_zero_shot
method, passing in the target text, cue text, and audio. - Save the generated voice file.
- Prepare a 16kHz cue audio (e.g.
- take note of: To reproduce the results of the official website, please set the
text_frontend=False
The
cross-language speech synthesis
Cross-language synthesis supports the generation of multilingual speech with consistent timbre. For example, generating speech with laughter:
for i, j in enumerate(cosyvoice.inference_cross_lingual(
'在他讲述那个荒诞故事的过程中,他突然[laughter]停下来,因为他自己也被逗笑了[laughter]。',
prompt_speech_16k,
stream=False)):
torchaudio.save(f'fine_grained_control_{i}.wav', j['tts_speech'], cosyvoice.sample_rate)
- procedure::
- Prepare the cue audio.
- Add sentiment tags to the text (e.g.
[laughter]
), for specific supported tags seecosyvoice/tokenizer/tokenizer.py#L248
The - Save the generated voice file.
Command-based speech generation
Command Generation supports specifying a dialect or style. For example, generating Sichuanese speech:
for i, j in enumerate(cosyvoice.inference_instruct2(
'收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。',
'用四川话说这句话',
prompt_speech_16k,
stream=False)):
torchaudio.save(f'instruct_{i}.wav', j['tts_speech'], cosyvoice.sample_rate)
- procedure::
- Provide target text and instructions (e.g., "Speak in Sichuan").
- Generate speech using prompted audio.
- Save the output file.
streaming synthesis
Streaming synthesis is suitable for real-time applications, with first-packet latency as low as 150ms. settings stream=True
You can enable the streaming output, refer to the above example for the specific code.
Other considerations
- Model Selection: Recommended use
CosyVoice2-0.5B
model for optimal sound quality and performance. - Environment Configuration: Ensure that Python version is 3.10 and that dependencies are installed correctly.
- adjust components during testing: If the cloning of the submodule fails, run it multiple times
git submodule update --init --recursive
Until it works. - documentation reference: For more details see
cosyvoice/tokenizer/tokenizer.py
and official documentationhttps://funaudiollm.github.io/cosyvoice2
The
application scenario
- Voice assistant development
CosyVoice's low-latency streaming synthesis and multi-language support are ideal for developing intelligent voice assistants. Developers can quickly customize personalized tones using zero-sample generation to generate natural and smooth voice responses that enhance the user interaction experience. - Multilingual content creation
Content creators can use cross-language synthesis to quickly generate multilingual voiceovers. For example, generate voices for videos or podcasts in English, Chinese, or other languages to maintain a consistent tone and reduce production costs. - Education and Language Learning
CosyVoice supports dialect and emotion control and can be used in language learning applications to generate speech with a specific accent or emotion to help learners practice listening and pronunciation. - Movie, TV & Game Dubbing
Film, TV and game developers can utilize fine-grained emotion control to generate speech with effects such as laughter and pauses to enhance character expression and improve the immersion of their work.
QA
- What languages does CosyVoice support?
CosyVoice supports speech synthesis in multiple languages, a list of which can be found in the official documentation or thecosyvoice/tokenizer/tokenizer.py
View in, covering common languages and some dialects. - How to reduce the latency of speech generation?
Using the streaming synthesis mode (stream=True
) can reduce first-packet latency to 150ms. ensure that the use of high-performance hardware and optimized models (such as theCosyVoice2-0.5B
). - Is it necessary to train the model?
No. CosyVoice provides pre-trained models that can be downloaded and used directly. If you need to customize it, you can refer to the official documentation to train it from scratch. - How do I add a custom tone?
utilizationinference_zero_shot
method, pass in the target audio sample and cue text to generate a custom tone. Saving the tone information can be done by calling thecosyvoice.add_zero_shot_spk
The