Version 1.5 of LatentSync was released in March 2023 with several important optimizations for the Chinese environment. The most significant improvement is the reduction of the graphics memory required for training to 20GB from over 30GB in earlier versions, which makes it possible to complete model training using an RTX 3090-class graphics card.
- The graphics optimization is mainly achieved through an improved U-Net network architecture, including the use of stage2_efficient.yaml configuration
- In the inference phase, the video memory requirement is further reduced to only 6.8GB required
- This version especially enhances the recognition of Chinese phonemes and improves the encoding efficiency of Chinese audio through a redesigned data processing pipeline.
These improvements allow ordinary developers to use the tool to process Chinese content on consumer-grade hardware, significantly lowering the technical barrier.
This answer comes from the articleLatentSync: an open source tool for generating lip-synchronized video directly from audioThe