Overseas access: www.kdjingpai.com
Bookmark Us

Wan2.2-S2V-14B is a large-scale AI model developed by the Wan-AI team, specialized in generating high-quality videos based on audio, text and images. It adopts an innovative Mixed Expert (MoE) architecture, with 27B total parameters in the model, but only 14B parameters are activated at runtime, effectively balancing performance and computational cost. The core function of the model is "speech-driven", which can transform the input speech content, combined with user-supplied text descriptions and reference images, into dynamic video images. Wan2.2-S2V-14B pays particular attention to the "cinematic aesthetics" of the generated video by training on selected aesthetic data to achieve higher levels of lighting, composition and color. In addition, it supports a gesture control function that allows the user to guide the movements of the characters in the generated video through a gesture video, providing a higher degree of freedom for video creation.

Function List

  • Speech-driven generation: Using the audio file as the core driver, combined with textual cues and reference images to generate a video synchronized with the audio content.
  • Cinematic Aesthetics: Models are trained with special aesthetic data to produce videos with professional lighting, composition, and tone.
  • High Resolution Output: Support generating videos with 480P and 720P resolutions to meet the clarity needs of different scenes.
  • attitude control: The user can provide a video containing a specific action (pose video), and the model will generate a video based on that action sequence, enabling precise control of the character's pose.
  • Hybrid Expert Architecture (MoE): Utilizes an efficient MoE architecture that maintains relatively low computational resource consumption while ensuring strong generation capabilities.
  • Flexible input combinations: You can use audio and images only, or you can add additional text descriptions, providing a variety of combinations for creativity.
  • Adaptive video length: When no specific parameters are set, the length of the generated video is automatically adjusted according to the length of the input audio.

Using Help

The Wan2.2-S2V-14B model provides a detailed installation and usage process that allows users to quickly deploy and begin generating videos.

1. Environment preparation and installation

First, you need to clone the official code repository from GitHub and install the required dependency libraries.

Step 1: Clone the code repository
Open a terminal and execute the following command to download the project code locally:

git clone https://github.com/Wan-Video/Wan2.2.git
cd Wan2.2

Step 2: Install dependencies
Project dependenciestorchThe version needs to be greater than or equal to2.4.0. Next, use thepipmountingrequirements.txtAll libraries listed in the file.

pip install -r requirements.txt

take note of: If during installationflash_attnpackage fails to install, try installing all the other packages first, and then finally installing them individuallyflash_attnThe

2. Model downloads

Model files can be accessed via thehuggingface-climaybemodelscope-cliMake a download.

Download with Hugging Face CLI (You need to install thehuggingface_hub):

pip install "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.2-S2V-14B --local-dir ./Wan2.2-S2V-14B

Downloading with the ModelScope CLI (You need to install themodelscope):

pip install modelscope
modelscope download Wan-AI/Wan2.2-S2V-14B --local_dir ./Wan2.2-S2V-14B

After executing the command, the model weights and other related files will be downloaded to the current directory in theWan2.2-S2V-14Bfolder.

3. Generating videos: processes and commands

The model supports multiple modes of video generation, including single-GPU inference and multi-GPU distributed inference.

Scenario 1: Basic speech-video generation (single GPU)

This is the most basic way to use it, and is suitable for users who have enough video memory (the official tip is that you need at least 80GB of VRAM).

command format::

python generate.py --task s2v-14B --size 1024*704 --ckpt_dir ./Wan2.2-S2V-14B/ --offload_model True --convert_model_dtype --prompt "一个文本描述" --image "参考图片路径" --audio "音频文件路径"

Parameter details::

  • task s2v-14B: Specifies the use of the Speech-to-Video task.
  • size 1024*704: Set the resolution of the generated video. The aspect ratio of the video is automatically adjusted according to the input reference image.
  • ckpt_dir ./Wan2.2-S2V-14B/: Specifies the path to the downloaded model file.
  • offload_model True: Offloads some components of the model to the CPU to conserve video memory.
  • convert_model_dtype: Convert model parameter types to optimize performance.
  • prompt "...": An input text cue that describes the style, content, or subject of the video. Example. "夏日海滩度假风格,一只戴着太阳镜的白猫坐在冲浪板上。"The
  • image "...": Provide a path to a reference image, for example "./examples/i2v_input.JPG". The model will be based on the style and subject of this image.
  • audio "...": Provide the path to the audio file generated by the driver video, for example "./examples/talk.wav"The

typical example::

python generate.py --task s2v-14B --size 1024*704 --ckpt_dir ./Wan2.2-S2V-14B/ --offload_model True --convert_model_dtype --prompt "夏日海滩度假风格,一只戴着太阳镜的白猫坐在冲浪板上。" --image "examples/i2v_input.JPG" --audio "examples/talk.wav"

Scenario 2: Posture-driven speech-video generation

If you want the character or subject in the generated video to follow a specific action, you can use the gesture-driven feature.

command format::

torchrun --nproc_per_node=8 generate.py --task s2v-14B --size 1024*704 --ckpt_dir ./Wan2.2-S2V-14B/ --dit_fsdp --t5_fsdp --ulysses_size 8 --prompt "一个文本描述" --image "参考图片路径" --audio "音频文件路径" --pose_video "姿态视频路径"

new parameter::

  • pose_video "...": Specify the path of a pose reference video, for example "./examples/pose.mp4". The model extracts the action sequences from this video and applies them to the newly generated video.

typical example::

torchrun --nproc_per_node=8 generate.py --task s2v-14B --size 1024*704 --ckpt_dir ./Wan2.2-S2V-14B/ --dit_fsdp --t5_fsdp --ulysses_size 8 --prompt "一个人正在唱歌" --image "examples/pose.png" --audio "examples/sing.MP3" --pose_video "./examples/pose.mp4"

This command is usually run in a multi-GPU environment for better performance.

application scenario

  1. Digital people and virtual anchors
    It can generate a virtual anchor image with synchronized mouth shape and natural expression based on pre-recorded audio or real-time voice input, and can control its movements through gesture video, which is widely used in live broadcasting, online education and news broadcasting.
  2. Automated production of video content
    Text content such as blog posts, press releases, or novels, paired with appropriate background music or narration, are automatically converted into videos. This greatly improves content creation efficiency for social media, advertising and marketing.
  3. Music Video (MV) Creation
    Music creators can input their own songs and provide reference images and text descriptions that match the mood of the song to quickly generate music videos with an artistic feel, providing a low-cost MV production solution for independent musicians.
  4. Personalized Audiobooks
    Audio narration for children's stories, combined with illustration-style reference drawings, generates vivid animated story videos. Parents or educational institutions can easily create customized visual reading materials for children.

QA

  1. What are the hardware requirements to run this model?
    Running a model with 14B parameters in a single GPU environment requires at least 80GB of graphics memory (VRAM). For users with insufficient video memory, the official recommendation is to use a multi-GPU configuration to share the computational pressure.
  2. How is the length of the generated video determined?
    By default, the model automatically adjusts the length of the generated video based on the length of the input audio file. If you want to quickly preview or generate a clip of a specified length, you can set the--num_clipparameter to control the number of video clips generated.
  3. Do I have to provide text, images and audio at the same time?
    Not. The core driver of the model is audio, but there is flexibility to combine inputs. The most common use is to combine audio and reference images, with text prompts (prompts) being optional and used to further guide the style and content of the video generation.
  4. What kind of video is supported by the Attitude Control feature?
    Attitude control functions are provided through the--pose_videoparameter implementation, it will recognize the action sequences of a human body or an object in the input video. Theoretically, any video containing clear actions can be used as input and the model will try to reproduce these actions in the generated video.
0Bookmarked
0kudos

Recommended

Can't find AI tools? Try here!

Just type in the keyword Accessibility Bing SearchYou can quickly find all the AI tools on this site.

Top

en_USEnglish