LatentSync provides a specialized data preprocessing pipeline to ensure that the input video meets the model requirements. This data processing pipeline utilizes a multi-level quality check mechanism:
- Scene segmentation using PySceneDetect, retaining 5-10 seconds of valid segments
- Detect and align the face region by face-alignment library, uniformly adjust to 256×256 resolution
- Calculate audio and video synchronization scores based on SyncNet and filter samples with scores below 3
- Evaluate visual quality using hyperIQA to remove low-quality content with scores below 40
This process not only ensures the quality of training data, but also provides a standard reference for input preprocessing in the inference phase. It is officially recommended that users process the customized data according to the same standard before use, which is the key to obtaining the desired results.
This answer comes from the articleLatentSync: an open source tool for generating lip-synchronized video directly from audioThe