Diffuman4D is a project developed by the ZJU3DV research team at Zhejiang University, focusing on generating high-fidelity 4D human body views from sparse view videos. The project combines spatio-temporal diffusion modeling and 4DGS (4D Gaussian Splatting) technology to solve the problem that traditional methods are difficult to generate high-quality views with sparse input. It supports real-time free-view rendering by generating multi-view consistent videos and reconstructing high-resolution (1024p) 4D models by combining the input videos. The project is suitable for scenes that require high-precision human motion capture and rendering, such as virtual reality and animation production. The code and model are open-sourced at GitHub, and the research results have been accepted by ICCV 2025.
Function List
- Generating spatio-temporally consistent multi-view videos from sparse view videos.
- Construct high-fidelity 4DGS models based on generated and input videos.
- Supports real-time free-view rendering to render complex costumes and dynamic movements.
- Provides Skeleton-Plücker conditional encoding for enhanced video generation consistency.
- 4DGS reconstruction using LongVolcap technology to optimize rendering quality.
- Open source code and models for researchers and developers.
Using Help
Installation process
- environmental preparation
Ensure that Python 3.8 or later is installed on your system. A virtual environment is recommended to avoid dependency conflicts. You can create a virtual environment with the following command:python -m venv diffuman4d_env source diffuman4d_env/bin/activate # Linux/Mac diffuman4d_env\Scripts\activate # Windows
- Cloning Codebase
Run the following command in a terminal or command line to download the Diffuman4D code:git clone https://github.com/zju3dv/Diffuman4D.git cd Diffuman4D
- Installation of dependencies
Project dependencies include PyTorch, NumPy, OpenCV and other libraries. Run the following command to install all the dependencies:pip install -r requirements.txt
If GPU support is required, make sure to install a version of PyTorch that is compatible with the CUDA version, which can be accessed through the
pip install torch torchvision
Install the latest version of PyTorch. - Download pre-trained model
The project provides pre-trained models, which should be downloaded from the GitHub release page or the link specified in the official documentation. After downloading, unzip the model files into the project root directory under thepretrained_models
Folder. - Verify Installation
Run the sample script to check that the environment is configured correctly:python scripts/test_setup.py
If no error is reported, the environment is configured successfully.
Usage
1. Data preparation
- Input Video: Prepare at least two videos with sparse viewpoints, recommended resolution is 720p or above, format supports MP4 or AVI. videos should contain human body movements, background should be as simple as possible to minimize interference.
- Skeleton data: The project is encoded using the Skeleton-Plücker condition and requires skeleton data (which can be extracted via OpenPose or MediaPipe). The skeleton data is stored in JSON format and contains keypoint coordinates and timestamps.
- Storage Path: Place the input video and skeleton data into the project directory in the
data/input
folder, make sure the file name corresponds to the configuration file.
2. Generation of multi-view videos
- The generation script is run to invoke the spatio-temporal diffusion model to generate a multi-view consistent video:
python scripts/generate_views.py --input_dir data/input --output_dir data/output --model_path pretrained_models/diffuman4d.pth
- Parameter Description:
--input_dir
: Enter the folder paths for the video and skeleton data.--output_dir
: Save path for the generated video.--model_path
: Pre-training model paths.
- The generated video will be saved in the
data/output
folder with 1024p resolution and support for multi-view consistency.
3. Reconstruction of the 4DGS model
- Input and generated videos are composited into 4DGS models using LongVolcap technology:
python scripts/reconstruct_4dgs.py --input_dir data/input --generated_dir data/output --output_model models/4dgs_output.ply
- Parameter Description:
--input_dir
: The original input video path.--generated_dir
: Generate the video path.--output_model
: Path to the output 4DGS model file.
- The generated model supports real-time rendering and can be viewed in a 4DGS-enabled rendering engine such as Unity or Unreal Engine.
4. Real-time rendering
- Import the generated 4DGS model into the rendering engine and adjust the viewing angle to achieve free-view rendering. High-performance GPUs (e.g. NVIDIA RTX series) are recommended to ensure smoothness.
- The project provides sample scripts
render_example.py
The rendering can be viewed by running it directly:python scripts/render_example.py --model_path models/4dgs_output.ply
5. Operation of special functions
- Skeleton-Plücker Code: Enhance the spatial and temporal consistency of the generated video with skeleton data and Plücker coordinates. The user needs to add the following to the configuration file
config.yaml
Specify the skeleton data path and target viewpoint parameters in theskeleton_path: data/input/skeleton.json target_views: [0, 45, 90, 135]
- high fidelity rendering4DGS models support rendering of complex costumes and dynamic movements. Users can adjust lighting and material parameters during rendering to optimize visual effects.
- open source resource: The project provides detailed documentation and example datasets located in the
docs/
cap (a poem)data/example/
folder for quick and easy access.
caveat
- hardware requirement: The generation and reconstruction process requires a GPU with at least 16GB of RAM and 8GB of VRAM. an NVIDIA GPU is recommended for optimal performance.
- Data quality: The quality of the input video directly affects the generated results, and it is recommended to use clear, unobstructed videos.
- Debugging Support: If problems are encountered, refer to
docs/troubleshooting.md
or submit a GitHub Issue.
application scenario
- Virtual Reality and Game Development
Diffuman4D generates high fidelity 4D human models for VR games or virtual character creation. Developers only need to provide a few cell phone videos to generate dynamic characters that can be rendered from different viewpoints, reducing the cost of specialized equipment. - Film and animation production
Animators can use Diffuman4D to generate high-quality motion sequences from small amounts of video for rendering virtual characters in film or animation, especially for scenes requiring complex costumes or dynamic movement. - Motion Capture Research
Researchers can use Diffuman4D to conduct 4D reconstruction experiments and explore human body modeling techniques in sparse views. The open source code supports secondary development and is suitable for academic research. - Education and training
In dance or physical education, Diffuman4D generates multi-perspective videos of movements, helping students to see the details of the movements from different perspectives, thus enhancing the effectiveness of teaching and learning.
QA
- What input video formats does Diffuman4D support?
Supports common video formats such as MP4, AVI, etc. Recommended resolution is 720p or above, frame rate 24-30fps. - How long does it take to generate a video?
Depends on hardware performance and input video length. On the NVIDIA RTX 3090, it takes about 5-10 minutes to generate a 10-second multi-view video. - Is specialized equipment required?
No. Diffuman4D was designed to generate high-quality models from ordinary cell phone videos without the need for specialized motion capture equipment. - How to optimize the generated results?
Provides clear input video, reduces background interference, and ensures accurate skeleton data. Adjusting the viewing angle parameters in the configuration file improves consistency.