Current Position:fig. beginning " AI Answers

How to achieve high quality voice cloning using only 15 seconds of audio?

2025-09-10

2.2 K

Complete process for short duration audio cloning

At the heart of the Llasa-3B's realization of short-duration audio cloning lies:

xcodec2 feature extraction: Encoding 15 seconds of audio into a 384-dimensional vector sequence (requires a sampling rate of 16 kHz)
Prefix bootstrap generation: Convert the feature vector into a formatted token prefix (<|s_[id]|>), inserted into the generated prompt
end-to-end conversion: The model automatically learns vocal features based on this prefix to maintain timbre consistency

Key considerations: 1) The original audio needs to be clear and free of background noise; 2) Use the.unsqueeze(0).unsqueeze(0)keep the input dimensions correct; 3) cloning effects can be adjusted by adjusting thetop_p=1Parameter optimization.

This answer comes from the articleLlasa 1~8B: an open source text-to-speech model for high quality speech generation and cloningThe

May not be reproduced without permission:AI productivity tools " How to achieve high quality voice cloning using only 15 seconds of audio?

How to achieve high quality voice cloning using only 15 seconds of audio?

Complete process for short duration audio cloning

Recommended

Can't find AI tools? Try here!

Popular AI tools

New Releases

Latest AI tools

How to achieve high quality voice cloning using only 15 seconds of audio?

Complete process for short duration audio cloning

Recommended

Can't find AI tools? Try here!

Popular AI tools

New Releases

Latest AI tools

Quick query station AI tool