Overseas access: www.kdjingpai.com
Bookmark Us

OmniAvatar is an open source project jointly developed by Zhejiang University and Alibaba, focusing on generating full-body avatar videos through audio input. Based on deep learning techniques, the project uses audio and text cues to generate natural and smooth avatar animations, especially in lip synchronization and full-body motion coordination.OmniAvatar supports video generation in a variety of scenarios, such as podcasts, interactive dialogues, and dynamic performances. It improves the accuracy of lip synchronization and the naturalness of movements through pixel-level multilevel audio embedding and LoRA training methods. The project code and model weights are publicly available and can be accessed via GitHub and run locally.OmniAvatar is suitable for movie, TV, game and social media content creation, generating high quality avatar animations.

 

Function List

  • Audio Driven Video Generation: Generate a full body animation of the avatar based on the input audio, with lip movements highly synchronized with the audio.
  • Text Prompt Control: Support for controlling the emotions, movements and background environment of avatars through text commands.
  • Multilingual Lip Synchronization: Supports lip synchronization in 31 languages including Chinese, English, and Japanese.
  • Total body coordination: Generate natural shoulder movements, gesture rhythms, and other full-body animations.
  • Scene Interaction Support: The avatar can interact with objects in the scene, suitable for scenes such as product demonstrations.
  • Multiple resolution outputs: Support 480p video generation, suitable for different equipment needs.
  • Open Source Modeling Support1.3B and 14B parameter models are provided to adapt to different hardware configurations.

Using Help

Installation process

To use OmniAvatar, you need to configure the runtime environment locally and download the pre-trained model. The following are detailed steps for installation and use:

  1. Cloning Project Code
    Run the following command in a terminal to clone the OmniAvatar code repository:

    git clone https://github.com/Omni-Avatar/OmniAvatar.git
    

    Once the cloning is complete, go to the project directory:

    cd OmniAvatar
    
  2. Installation of dependencies
    The project requires a Python environment and specific dependency libraries. Run the following command to install PyTorch and other dependencies:

    pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu124
    pip install -r requirements.txt
    

    If you have a high-performance GPU, you can optionally install flash_attn Accelerated Attention Computing:

    pip install flash_attn
    
  3. Download pre-trained model
    OmniAvatar relies on several pre-trained models, including Wan2.1-T2V-14B, wav2vec2-base-960h, and OmniAvatar-14B. using the huggingface-cli Download:

    mkdir pretrained_models
    pip install "huggingface_hub[cli]"
    huggingface-cli download Wan-AI/Wan2.1-T2V-14B --local-dir ./pretrained_models/Wan2.1-T2V-14B
    huggingface-cli download facebook/wav2vec2-base-960h --local-dir ./pretrained_models/wav2vec2-base-960h
    huggingface-cli download OmniAvatar/OmniAvatar-14B --local-dir ./pretrained_models/OmniAvatar-14B
    

    If hardware resources are limited, you can choose the 1.3B parametric model and download it in a similar way.

  4. Preparing the input file
    Create an input file (e.g. infer_samples.txt) that contains the audio file path and text prompt. Example:

    audio_path: examples/audio/sample.wav
    prompt: "A happy person speaking in a bright room"
    

    Ensure that the audio file format is WAV and that the text cues clearly describe the character's mood, action, or setting.

  5. Running inference scripts
    utilization torchrun Run the inference script to generate the video. For 14B models:

    torchrun --standalone --nproc_per_node=1 scripts/inference.py --config configs/inference.yaml --input_file examples/infer_samples.txt
    

    For the 1.3B model:

    torchrun --standalone --nproc_per_node=1 scripts/inference.py --config configs/inference_1.3B.yaml --input_file examples/infer_samples.txt
    

    The output video will be saved in the specified folder (e.g. results) Medium.

Main Functions

  • Generate audio-driven video
    The user needs to prepare a reference image and an audio clip. The reference image is used to define the appearance of the avatar, and the audio drives the lip and full body movements. After running an inference script, the system generates a synchronized video based on the audio, with lip movements that are highly matched to the rhythm of the voice. For example, if the user inputs the audio of a speech, OmniAvatar will generate the natural gestures and expressions of the character as he or she speaks.
  • Text Prompt Control
    Through text prompts, the user can control the avatar's emotions (e.g., "happy" or "angry"), actions (e.g., "waving"), or background (e.g., "beach"). "beach"). The cue should be clear and specific, e.g. "A surprised person dancing in a forest". The system will adjust the details of the animation according to the cue.
  • Multi-language support
    OmniAvatar uses Wav2Vec2 to extract audio features and supports lip synchronization in 31 languages. Users can input audio in any of the supported languages and the system will automatically generate the corresponding lip movements without additional configuration.
  • Scene Interaction
    Add object interaction description (e.g. "holding a cup") in the text prompts, and the avatar can interact with the scene objects, which is suitable for e-commerce display or plot animation.

caveat

  • hardware requirement: The 14B model requires a high-performance GPU (e.g., A6000), and the 1.3B model is suitable for consumer-grade hardware with 8GB of VRAM.
  • Generation speed: The 14B model takes about 30 seconds per frame on a single GPU, and the 1.3B model is faster and suitable for lower-end devices.
  • output check: After generating the video, check the MP4 file in the output folder to ensure lip synchronization and natural movement.

application scenario

  1. Podcast video production
    Users can turn podcast audio into avatar videos for added visual appeal, and OmniAvatar ensures that lip movements are synchronized with the audio, making it ideal for quickly producing high-quality podcast content.
  2. Virtual Anchor Generation
    Social media creators can use OmniAvatar to generate virtual anchor performance videos with support for real-time text control of mood and context for live or short video platforms.
  3. Film, TV & Game Animation
    Film, TV and game developers can use OmniAvatar to quickly generate character animation and reduce the cost of traditional animation, especially for projects that require a large number of dialog scenes.
  4. E-commerce Product Showcase
    Through the scene interaction function, avatars can display products (e.g. clothing or electronic devices) to enhance the realism of marketing content.

QA

  1. What languages does OmniAvatar support for audio input?
    31 languages are supported, including Chinese, English, Japanese, etc. Lip synchronization is guaranteed by the Wav2Vec2 model.
  2. What hardware configuration is required to run it?
    1.3B models require at least 8GB of VRAM, and 14B models recommend a datacenter-class GPU (e.g., A6000).
  3. What is the resolution of the generated video?
    Currently supports 480p resolution and may be expanded to higher resolutions in the future.
  4. How can I improve my generation speed?
    Try 1.3B model or install flash_attn Accelerated Attention Computing.
0Bookmarked
0kudos

Recommended

Can't find AI tools? Try here!

Just type in the keyword Accessibility Bing SearchYou can quickly find all the AI tools on this site.

inbox

Contact Us

Top

en_USEnglish