VibeVoice-1.5B is a cutting-edge open-source Text-to-Speech (TTS) model released by Microsoft Research. It is specifically designed for generating expressive, long-form, multi-character dialog audio, such as podcasts or audiobooks. The core innovation of VibeVoice is its use of a continuous speech disambiguator (acoustic and semantic) running at an ultra-low frame rate of 7.5 Hz, which greatly improves the computational efficiency of processing long sequences while effectively preserving audio fidelity. The model is based on a large-scale language model (LLM) to understand textual context and dialog flow, and combines it with a diffusion model to generate high-fidelity acoustic details. VibeVoice is capable of synthesizing up to 90 minutes of audio at a time and can support up to four different speakers in a single segment of audio, breaking the limitation of many previous models that support only one or two speakers. The model is primarily trained using English and Chinese data, and supports both cross-language synthesis and basic singing synthesis.
Function List
- Ultra-long audio synthesis: Supports the generation of up to 90 minutes of coherent speech audio in a single task.
- More talker support: The ability to simulate natural conversations with up to 4 different speakers in the same audio.
- Expressive voice: The generated speech is more natural in emotion and expression, getting rid of the mechanical feeling of the traditional TTS model.
- Cross-Language and Singing Synthesis: Although the main training data are Chinese and English, the model has some cross-language synthesis capability (e.g., inputting English to generate Chinese speech) and basic singing ability.
- Open source and accessible: The model is open-sourced under the MIT license, is friendly to the research community, and provides a corresponding code base and technical reports for developers to use.
- High Efficiency Architecture: Efficiently handles the task of generating long sequences of audio using innovative acoustic and semantic splitters running at very low frame rates.
- security measure: To prevent misuse, the model automatically embeds "AI-generated" audible statements and imperceptible watermarks in the generated audio.
Using Help
VibeVoice-1.5B is mainly for researchers and developers, and general users can experience it through the Gradio demo program on Hugging Face. For developers, the following steps can be followed to deploy and use the model in their local environment.
Environment Preparation and Installation
First, you need to make sure that Python and PyTorch are installed in your environment.Since the model requires some computational resources, it is recommended to use it in a Linux or Windows (via WSL2) environment equipped with an NVIDIA GPU (no less than 10GB of video memory recommended).
- Clone Code Repository::
Clone VibeVoice's code repository from GitHub.git clone https://github.com/microsoft/VibeVoice-Code.git cd VibeVoice-Code
- Installation of dependencies::
The codebase usually provides arequirements.txt
file that contains all the necessary Python dependency libraries.pip install -r requirements.txt
Model Download
The VibeVoice-1.5B model file is hosted on Hugging Face. You need to specify the model path in the codemicrosoft/VibeVoice-1.5B
Hugging Face.transformers
The library automatically downloads the required model files.
How to use (code example)
The core functionality of VibeVoice is to perform text-to-speech conversion by writing scripts to invoke the model. Below is a basic usage flow and code snippet demonstrating how to generate an audio of a multi-person conversation.
- Prepare to enter text::
VibeVoice uses a simple format to distinguish between different speakers. You need to label the text with the identity of each speaker, for example[speaker 0]
maybe[speaker 1]
Thetext = """ [speaker 0] 你好,欢迎收听我们的AI播客。今天我们来聊聊最新的语音合成技术。 [speaker 1] 没错,特别是像VibeVoice这样的模型,它能生成长达90分钟的对话,真是太惊人了。 [speaker 0] 是的,而且它还支持最多4个不同的声音。这意味着我们可以制作更复杂的广播剧或者多人有声书了。 [speaker 1] 让我们来听听效果吧! """
- Writing reasoning scripts::
You need to load the model and the processor (tokenizer) and then feed the prepared text into the model to generate the audio.import torch from transformers import AutoProcessor, AutoModelForTextToWaveform import scipy.io.wavfile # 确定设备 device = "cuda" if torch.cuda.is_available() else "cpu" # 加载模型和处理器 processor = AutoProcessor.from_pretrained("microsoft/VibeVoice-1.5B") model = AutoModelForTextToWaveform.from_pretrained("microsoft/VibeVoice-1.5B").to(device) # 准备输入 inputs = processor(text=text, return_tensors="pt").to(device) # 生成语音波形 with torch.no_grad(): waveform = model.generate(**inputs, do_sample=True, temperature=0.9) # 保存音频文件 # 注意:采样率需要从模型配置中获取,这里以24000为例 sampling_rate = model.config.sampling_rate scipy.io.wavfile.write("output_dialogue.wav", rate=sampling_rate, data=waveform[0].cpu().numpy()) print("音频文件已生成:output_dialogue.wav")
This script generates a file called
output_dialogue.wav
audio file that contains a conversation between two speakers.
Featured Feature Operation: One-shot Voice Cloning
In a number of community-provided demos (Demos), VibeVoice demonstrates the power of single-sample voice cloning. The user simply provides a small audio sample of the target voice, and the model can mimic the timbre of that voice to read new text aloud.
In the Gradio demo interface, there is usually an area for uploading audio files.
- Upload a clear, background noise-free audio file (e.g. in WAV or MP3 format) containing the sounds you want to clone.
- In the text input box, type the text you want the model to read aloud with this voice.
- Click the "Generate" button and the model will use the uploaded audio tones to synthesize a new voice.
caveat
- Research use only: Officials emphasize that the model is currently for research use only and is not recommended for use in commercial or production environments.
- language restriction: The model is optimized primarily for English and Chinese, and may produce unpredictable or poor quality output in other languages.
- no background noise: The model generates only pure vocals, no background music or ambient noise is added.
- no speech overlap: The current version does not support the simulation of robocalls or speech overlaps that are common in multiplayer conversations, where transitions between speakers are sequential.
application scenario
- Podcast and audiobook production
Utilizing VibeVoice's ability to generate up to 90 minutes in length and support up to four characters, content creators can efficiently convert scripts or books into audio content in the form of multi-player conversations, dramatically reducing recording costs. - Game Character Voiceover
Game developers can use the model to generate large amounts of dialog for non-player characters (NPCs). Its expressive features can make character voices sound more natural and enhance game immersion. - Content accessibility
Convert long articles, news or reports into natural speech for visually impaired users. The multi-speaker feature can be used to differentiate between quotes and others' comments, making the content easier to understand. - language learning
Models can be used to create language learning materials that simulate real conversation scenarios. By adjusting the voices of different characters, it can help learners to better adapt to different accents and speeds of speech.
QA
- What languages does VibeVoice-1.5B support?
The model is mainly trained and optimized using English and Chinese data. Although it has some cross-language synthesis capability, the results may be unstable or unsatisfactory when dealing with other languages. - Are there hardware requirements to use VibeVoice-1.5B?
Yes, for better inference speeds, it is recommended to run on a device with an NVIDIA GPU with at least 10GB of video memory. Running in a CPU-only environment can be very slow. - Can the generated audio be used for commercial projects?
Not allowed. According to the official instructions, the released version of the model is limited to research purposes and is not recommended for any commercial applications. Any use is subject to the use restrictions in the MIT license and model card, such as the prohibition of use for voice impersonation or disinformation dissemination. - Can VibeVoice generate speech in real time?
The current version is not suitable for real-time or low-latency voice conversion applications, such as "real-time deep faking" in telephony or videoconferencing. It is designed to focus on high quality offline generation of long audio. - Is the speech generated by the model watermarked?
Yes, to prevent malicious use, all audio synthesized through the model is automatically embedded with an audible AI statement (e.g. "This segment was generated by AI") and an imperceptible digital watermark for traceability.