Step3 A three-layer optimization scheme is designed for real-time requirements:
- Architecture level: The MoE model activates only about 121 TP3T of parameters (3.8 billion/321 billion), reducing single inference elapsed time by 401 TP3T
- Deployment level: vLLM engine is recommended, its continuous batching technology can increase throughput by 3-5 times!
- parameter level: Settings
max_new_tokens=512
The response time of the A800 graphics card can be controlled within 500ms.
Key Configuration Tip: When starting the vLLM service add the --tensor-parallel-size=4
Parameters take full advantage of multi-GPU parallel computing, with measured QPS (queries per second) up to 120+.
This answer comes from the articleStep3: Efficient generation of open source big models for multimodal contentThe