Current Position:fig. beginning " AI Tool

OmniAvatar: Generating Audio-Driven Full-Body Avatar Videos

2025-07-25

35 0

https://github.com/Omni-Avatar/OmniAvatar

OmniAvatar is an open source project jointly developed by Zhejiang University and Alibaba, focusing on generating full-body avatar videos through audio input. Based on deep learning techniques, the project uses audio and text cues to generate natural and smooth avatar animations, especially in lip synchronization and full-body motion coordination.OmniAvatar supports video generation in a variety of scenarios, such as podcasts, interactive dialogues, and dynamic performances. It improves the accuracy of lip synchronization and the naturalness of movements through pixel-level multilevel audio embedding and LoRA training methods. The project code and model weights are publicly available and can be accessed via GitHub and run locally.OmniAvatar is suitable for movie, TV, game and social media content creation, generating high quality avatar animations.

Function List

Audio Driven Video Generation: Generate a full body animation of the avatar based on the input audio, with lip movements highly synchronized with the audio.
Text Prompt Control: Support for controlling the emotions, movements and background environment of avatars through text commands.
Multilingual Lip Synchronization: Supports lip synchronization in 31 languages including Chinese, English, and Japanese.
Total body coordination: Generate natural shoulder movements, gesture rhythms, and other full-body animations.
Scene Interaction Support: The avatar can interact with objects in the scene, suitable for scenes such as product demonstrations.
Multiple resolution outputs: Support 480p video generation, suitable for different equipment needs.
Open Source Modeling Support1.3B and 14B parameter models are provided to adapt to different hardware configurations.

Using Help

Installation process

To use OmniAvatar, you need to configure the runtime environment locally and download the pre-trained model. The following are detailed steps for installation and use:

Cloning Project Code
Run the following command in a terminal to clone the OmniAvatar code repository:
```
git clone https://github.com/Omni-Avatar/OmniAvatar.git
```
Once the cloning is complete, go to the project directory:
```
cd OmniAvatar
```
Installation of dependencies
The project requires a Python environment and specific dependency libraries. Run the following command to install PyTorch and other dependencies:
```
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt
```
If you have a high-performance GPU, you can optionally install flash_attn Accelerated Attention Computing:
```
pip install flash_attn
```

Download pre-trained model
OmniAvatar relies on several pre-trained models, including Wan2.1-T2V-14B, wav2vec2-base-960h, and OmniAvatar-14B. using the huggingface-cli Download:

mkdir pretrained_models
pip install "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.1-T2V-14B --local-dir ./pretrained_models/Wan2.1-T2V-14B
huggingface-cli download facebook/wav2vec2-base-960h --local-dir ./pretrained_models/wav2vec2-base-960h
huggingface-cli download OmniAvatar/OmniAvatar-14B --local-dir ./pretrained_models/OmniAvatar-14B

If hardware resources are limited, you can choose the 1.3B parametric model and download it in a similar way.

Preparing the input file
Create an input file (e.g. infer_samples.txt) that contains the audio file path and text prompt. Example:
```
audio_path: examples/audio/sample.wav
prompt: "A happy person speaking in a bright room"
```
Ensure that the audio file format is WAV and that the text cues clearly describe the character's mood, action, or setting.

Running inference scripts
utilization torchrun Run the inference script to generate the video. For 14B models:

torchrun --standalone --nproc_per_node=1 scripts/inference.py --config configs/inference.yaml --input_file examples/infer_samples.txt

For the 1.3B model:

torchrun --standalone --nproc_per_node=1 scripts/inference.py --config configs/inference_1.3B.yaml --input_file examples/infer_samples.txt

The output video will be saved in the specified folder (e.g. results) Medium.

Main Functions

Generate audio-driven video
The user needs to prepare a reference image and an audio clip. The reference image is used to define the appearance of the avatar, and the audio drives the lip and full body movements. After running an inference script, the system generates a synchronized video based on the audio, with lip movements that are highly matched to the rhythm of the voice. For example, if the user inputs the audio of a speech, OmniAvatar will generate the natural gestures and expressions of the character as he or she speaks.
Text Prompt Control
Through text prompts, the user can control the avatar's emotions (e.g., "happy" or "angry"), actions (e.g., "waving"), or background (e.g., "beach"). "beach"). The cue should be clear and specific, e.g. "A surprised person dancing in a forest". The system will adjust the details of the animation according to the cue.
Multi-language support
OmniAvatar uses Wav2Vec2 to extract audio features and supports lip synchronization in 31 languages. Users can input audio in any of the supported languages and the system will automatically generate the corresponding lip movements without additional configuration.
Scene Interaction
Add object interaction description (e.g. "holding a cup") in the text prompts, and the avatar can interact with the scene objects, which is suitable for e-commerce display or plot animation.

caveat

hardware requirement: The 14B model requires a high-performance GPU (e.g., A6000), and the 1.3B model is suitable for consumer-grade hardware with 8GB of VRAM.
Generation speed: The 14B model takes about 30 seconds per frame on a single GPU, and the 1.3B model is faster and suitable for lower-end devices.
output check: After generating the video, check the MP4 file in the output folder to ensure lip synchronization and natural movement.

application scenario

Podcast video production
Users can turn podcast audio into avatar videos for added visual appeal, and OmniAvatar ensures that lip movements are synchronized with the audio, making it ideal for quickly producing high-quality podcast content.
Virtual Anchor Generation
Social media creators can use OmniAvatar to generate virtual anchor performance videos with support for real-time text control of mood and context for live or short video platforms.
Film, TV & Game Animation
Film, TV and game developers can use OmniAvatar to quickly generate character animation and reduce the cost of traditional animation, especially for projects that require a large number of dialog scenes.
E-commerce Product Showcase
Through the scene interaction function, avatars can display products (e.g. clothing or electronic devices) to enhance the realism of marketing content.

QA

What languages does OmniAvatar support for audio input?
31 languages are supported, including Chinese, English, Japanese, etc. Lip synchronization is guaranteed by the Wav2Vec2 model.
What hardware configuration is required to run it?
1.3B models require at least 8GB of VRAM, and 14B models recommend a datacenter-class GPU (e.g., A6000).
What is the resolution of the generated video?
Currently supports 480p resolution and may be expanded to higher resolutions in the future.
How can I improve my generation speed?
Try 1.3B model or install flash_attn Accelerated Attention Computing.

Chief AI Sharing Circle " OmniAvatar: Generating Audio-Driven Full-Body Avatar Videos Posted on 2025-07-25, if you find the URL is out of date, or inaccessible, please contact us.

0Bookmarked

0kudos

OmniAvatar: Generating Audio-Driven Full-Body Avatar Videos

Function List

Using Help

Installation process

Main Functions

caveat

application scenario

QA

Related articles

Recommended

Can't find AI tools? Try here!

Recommended Tools

New Releases

OmniAvatar: Generating Audio-Driven Full-Body Avatar Videos

Function List

Using Help

Installation process

Main Functions

caveat

application scenario

QA

Related articles

Recommended

Can't find AI tools? Try here!

Recommended Tools

New Releases

Quick query station AI tool