SongGen integrates advanced voiceprint encoding technology to extract the speaker's tonal characteristics in just 3 seconds of reference audio. The technical implementation of this feature consists of two key components:
- voiceprint extraction: Extracting speaker embedding vectors using ECAPA-TDNN models
- feature fusion: Aligning acoustic features with musical content representations in latent space
In practice, the user can choose whether or not to separate the vocal track in the reference audio. When the separate parameter is set to True, the system will first perform the source separation process to ensure the purity of the cloned vocal features.
This technology allows users to sing the generated song in their preferred voice, greatly enhancing the personalization of the creation.
This answer comes from the articleSongGen: A Single-Stage Autoregressive Transformer for Automatic Song GenerationThe































