prescription
Diffuman4D effectively solves this problem by combining spatio-temporal diffusion modeling and 4D Gaussian Splash (4DGS) technology. The specific operation is divided into three steps: firstly, using Skeleton-Plücker conditional coding technology to enhance spatio-temporal consistency, the sparse viewpoints (at least 2) video input, through the pre-training model to generate multi-view consistent high-definition video (1024p); secondly, using the LongVolcap optimization algorithm for the reconstruction of the 4DGS, the generated video and the combination of the original input to build the High-fidelity 4D model is constructed by combining the generated video with the original inputs; finally, free-viewing is realized by a real-time rendering engine.
Implementation steps
- Prepare at least 2 videos in 720p resolution or higher, with a clean background recommended
- Extract skeleton data using MediaPipe/OpenPose and save as JSON format
- Run the generate_views.py script to generate multi-view videos
- Reconstructing the 4DGS model via reconstruct_4dgs.py
caveat
NVIDIA RTX graphics card (8GB VRAM or more) is recommended. 10-30 seconds of input video duration is recommended, and more accurate skeleton data is required for complex action scenes.
This answer comes from the articleDiffuman4D: Generating High-Fidelity 4D Human Body Views from Sparse VideoThe































