Background
In multilingual speech synthesis scenarios, traditional models are often difficult to maintain the consistency of the same timbre in different languages, resulting in a fragmented speech listening experience.CosyVoice specifically optimizes this pain point through cross-lingual speech cloning technology.
Core Solutions
- Using the Zero Sample Generation Function: By
inference_zero_shot
method, the model maintains its timbre characteristics across different language generations by providing only 3 seconds of reference audio.from cosyvoice import CosyVoice2 cosyvoice = CosyVoice2('pretrained_models/CosyVoice2-0.5B') prompt_audio = torchaudio.load('prompt.wav')[0] cosyvoice.inference_zero_shot(text, prompt_text, prompt_audio)
- Pre-trained model support: directly use the officially provided
CosyVoice2-0.5B
model, which has been jointly trained on a multilingual corpus - Tone Freeze Technology: Call
add_zero_shot_spk
method saves the timbre signature, eliminating the need to reload the audio for subsequent calls.
caveat
Ensure that the reference audio is at a 16kHz sample rate, and it is recommended to record a clear dry sound with ambient noise below -60dB. For professional scenes, check the audio fundamental frequency characteristics first with a tool such as Praat.
This answer comes from the articleCosyVoice: Ali open source multilingual cloning and generation toolsThe