Current Position:fig. beginning " AI Answers

How to use CosyVoice for zero-sample speech generation?

2025-08-23

1.1 K

Zero-sample speech generation is one of the important features of CosyVoice, and the procedure is as follows:

Preparing audio samples: A 16kHz prompt audio file (e.g. zero_shot_prompt.wav) is required.
Calling the generator function: Use the reference_zero_shot method and pass the appropriate parameters:
from cosyvoice import CosyVoice2 import torchaudio cosyvoice = CosyVoice2('pretrained_models/CosyVoice2-0.5B') prompt_speech_16k = torchaudio.load('./asset/zero_shot_prompt.wav')[0] cosyvoice.inference_zero_shot('目标文本','提示文本',prompt_speech_16k)
Saving the output::
torchaudio.save('output.wav', j['tts_speech'], cosyvoice.sample_rate)

Caveats:
- If you want to fully reproduce the effect of the official website, you need to set the text_frontend=False parameter.
- The CosyVoice 2-0.5B model is recommended for best results!
- This method generates speech based on short samples of the target timbre without pre-training.

This answer comes from the articleCosyVoice: Ali open source multilingual cloning and generation toolsThe

May not be reproduced without permission:AI productivity tools " How to use CosyVoice for zero-sample speech generation?