The emotional control of Orpheus-TTS is realized through a three-layer technical architecture:
- label parsing layer: The system has a built-in XML style tag parser that recognizes special tags such as and maps them to 32-dimensional sentiment embedding vectors.
- model architecture layer: Improvement of the decoder-only structure based on Llama-3b by adding emotion weight gating to the attention mechanism, which allows tags to dynamically adjust the fundamental frequency (F0) and energy parameters of speech
- Acoustic modeling layer: A modified HiFi-GAN vocoder is used, whose conditional adversarial training process receives sentiment vectors as a priori conditions to generate waveforms containing the corresponding paralinguistic features
Compared with ordinary TFS systems, the innovations are 1) integrating non-verbal feature processing into the end-to-end process and 2) discovering acoustic features of common emotional patterns (e.g., harmonic distortion patterns of laughter) through unsupervised clustering. Practical tests show that adding tags under the same text can improve the Jitter (jitter rate) of the generated speech by 37%, which is closer to the real laughter features.
This answer comes from the articleOrpheus-TTS: Text-to-Speech Tool for Generating Natural Chinese SpeechThe
































