Step3 offers two solutions to cope with video memory limitations:
- Use the optimized block-fp8 formatmodel weights, which significantly reduces the memory footprint compared to the traditional bf16 format.
- adoption Hybrid Model of Expertise (MoE) ArchitectureIn addition, the computational overhead is reduced by activating only a portion of the experts (3.8 billion active parameters).
Implementation: Download block-fp8 format weights from Hugging Face and deploy with vLLM inference engine. For A800/H800 GPUs with 80GB of memory, it is recommended to use 4-card parallel operation, and the memory consumption can be controlled within 60GB/card. If the hardware conditions are limited, you can appropriately reduce the max_new_tokens
parameter value (e.g., set to 512) reduces the computational pressure.
This answer comes from the articleStep3: Efficient generation of open source big models for multimodal contentThe