Zero-sample speech generation is one of the important features of CosyVoice, and the procedure is as follows:
- Preparing audio samples: A 16kHz prompt audio file (e.g. zero_shot_prompt.wav) is required.
- Calling the generator function: Use the reference_zero_shot method and pass the appropriate parameters:
from cosyvoice import CosyVoice2
import torchaudio
cosyvoice = CosyVoice2('pretrained_models/CosyVoice2-0.5B')
prompt_speech_16k = torchaudio.load('./asset/zero_shot_prompt.wav')[0]
cosyvoice.inference_zero_shot('目标文本','提示文本',prompt_speech_16k) - Saving the output::
torchaudio.save('output.wav', j['tts_speech'], cosyvoice.sample_rate)
Caveats:
- If you want to fully reproduce the effect of the official website, you need to set the text_frontend=False parameter.
- The CosyVoice 2-0.5B model is recommended for best results!
- This method generates speech based on short samples of the target timbre without pre-training.
This answer comes from the articleCosyVoice: Ali open source multilingual cloning and generation toolsThe