LatentSync is a professional-grade AI tool developed by ByteDance based on Stable Diffusion's potential diffusion model. The tool innovatively combines Whisper audio feature extraction technology and U-Net network architecture to realize direct conversion from audio to video frames. Its technical implementation consists of three core aspects:
- The phoneme features in the audio are first extracted by Whisper modeling
- The audio features are then mapped to the latent space of the video frame using a modified U-Net network
- Finally, a sampler with Stable Diffusion is used to generate video sequences with temporal continuity
This technological route breaks away from the traditional 3D modeling-based lip-synchronization method and achieves a more natural look. In version 1.5, the model also introduces TREPA timing optimization technology, which significantly improves the temporal consistency of the generated video.
This answer comes from the articleLatentSync: an open source tool for generating lip-synchronized video directly from audioThe