MOSS-TTSD is an open source dialog speech generation model that supports bilingual Chinese and English. It can convert two-person conversational text into natural and expressive speech, which is suitable for AI podcast production, language research, etc. The model is based on low-bit coding technology and supports zero-sample cloning and single-shot speech generation of up to 960 seconds. The model is based on low-bitrate coding technology and supports zero-sample two-person speech cloning and single-shot speech generation up to 960 s. MOSS-TTSD provides the complete model weights and inference code, and is free for commercial use. The latest version, currently v0.5, optimized for timbre switching and model stability, is available via GitHub.
Function List
- Supports bilingual dialog voice generation, outputs natural and expressive voice.
- Realize zero-sample two-person speech cloning to accurately distinguish between different speakers in a conversation.
- Supports single long speech generation up to 960 seconds, suitable for podcasts or long-form content production.
- Offers Podever, a podcast generation tool that turns PDFs, URLs, or long text into high-quality podcasts.
- Open source model weights, inference code and API interfaces with free commercial support.
- Provide model fine-tuning scripts, support full model fine-tuning and LoRA fine-tuning, adapt to customized datasets.
Using Help
Installation process
The installation of MOSS-TTSD needs to be done in an environment that supports Python. The following are the detailed installation steps:
- Creating a Virtual Environment
Create a separate Python environment using conda or pip, making sure not to interfere with other projects. We recommend using Python 3.10. Run the following command:conda create -n moss_ttsd python=3.10 -y conda activate moss_ttsd
- Cloning Codebase
Download the MOSS-TTSD code base from GitHub. Open a terminal and run it:git clone https://github.com/OpenMOSS/MOSS-TTSD.git cd MOSS-TTSD
- Installation of dependencies
The codebase contains arequirements.txt
file that lists the required dependencies. Install the dependencies:pip install -r requirements.txt pip install flash-attn
Attention:
flash-attn
is a library for accelerating the attention mechanism, make sure your GPU environment supports it. - Download model weights
Model weights for MOSS-TTSD can be downloaded from Hugging Face or GitHub Release page. The recommended version is v0.5. Place the downloaded model weights in the project root directory or a specified path. - Verify Installation
Run the sample script to check that the environment is configured correctly:python demo.py
If successful, a simple dialog voice file is generated.
Main Functions
1. Dialogue voice generation
The core function of MOSS-TTSD is to convert the text of a conversation into speech. The user needs to prepare a text file containing a two-person dialog in the example format:
Speaker1: 你好,今天天气怎么样?
Speaker2: 很好,阳光明媚!
Run the inference script to generate speech:
python inference.py --model_path <path_to_model> --input_text <path_to_text_file> --output_dir <output_directory>
Outputs a voice file in WAV format, which automatically distinguishes the tones of the two speakers.
2. Voice cloning
MOSS-TTSD supports zero-sample speech cloning. The user provides a piece of audio (at least 10 seconds) of the target speaker, and the model can generate the conversational speech of that timbre. Operation Steps:
- Prepare the target audio file (e.g.
speaker1.wav
cap (a poem)speaker2.wav
). - Modify the configuration file
config.yaml
, specify the audio path:speaker1: path/to/speaker1.wav speaker2: path/to/speaker2.wav
- Run the cloning script:
python clone_voice.py --config config.yaml --input_text dialogue.txt --output_dir cloned_output
3. Podcast generation (Podever)
Podever is MOSS-TTSD's podcast generation tool that turns long text, PDFs or URLs into podcasts. Steps of operation:
- Install the Podever extension:
pip install podever
- Prepare the input file (e.g. PDF or URL).
- Run command:
python podever.py --input <input_file_or_url> --output podcast.wav
Podever automatically extracts text and generates two-person conversation style podcasts suitable for popular science content or books read aloud.
4. Model fine-tuning
Users can fine-tune the model using a custom dataset. The steps are as follows:
- Prepare the dataset in JSON format containing the dialog text and the corresponding audio.
- Run the fine-tuning script:
python finetune/finetune.py --model_path <path_to_model> --data_dir <path_to_processed_data> --output_dir <output_directory> --training_config <training_config_file>
- Supports LoRA fine-tuning to reduce compute resource requirements:
python finetune/finetune.py --model_path <path_to_model> --data_dir <path_to_processed_data> --output_dir <output_directory> --training_config <training_config_file> --lora_config <lora_config_file>
caveat
- Ensure that the DNSMOS score of the input audio is ≥ 2.8 for sound quality.
- The model may not be sensitive enough to short conversational feedback (e.g., "um", "oh") and it is recommended that the speaker be clearly labeled in the text.
- Requires at least 12GB of GPU memory to run, NVIDIA GPUs are recommended.
application scenario
- AI podcast production
MOSS-TTSD turns articles, books or web content into two-person conversational podcasts. Users only need to provide the text and the Podever tool generates natural and smooth audio for self-publishing creators to produce content quickly. - Language Learning Tools
Teachers can use MOSS-TTSD to generate bilingual conversation audio to help students practice listening and speaking. The voice cloning function can simulate the timbre of real people to increase the fun of learning. - Accessibility assistance
MOSS-TTSD generates audiobooks or conversational newscasts for the visually impaired. Long speech generation supports output of complete chapters at once, reducing the frequency of operation. - academic research
Researchers can take advantage of the open source nature of MOSS-TTSD to explore speech synthesis techniques. The model supports fine-tuning and is suitable for developing customized speech applications.
QA
- What languages does MOSS-TTSD support?
Currently supports bilingual dialog generation in Chinese and English, with the possibility of expanding to more languages in the future. - How can the quality of speech generation be improved?
Use high quality input audio (DNSMOS ≥ 2.8) and make sure that the dialog text clearly labels the speaker. Fine-tuning the model can further improve the results. - Is it commercially available?
Yes, MOSS-TTSD is licensed under the Apache 2.0 license and supports free commercial use, subject to legal and ethical compliance. - What hardware is required for the model to run?
NVIDIA GPUs are recommended, with at least 12GB of video memory; CPUs may run slower and are not recommended for production environments.