A Practical Guide to Local Deployment
Deployment of Qwen3-30B-A3B requires the selection of an adapted solution based on hardware conditions:
- High Performance GPU Program: The recommended frameworks are vLLM (>=0.8.4) or SGLang (>=0.4.6), with the following startup commands respectively
vllm serve Qwen/Qwen3-30B-A3B --enable-reasoning
python -m sglang.launch_server --model-path Qwen/Qwen3-30B-A3B
- Lightweight Deployment: Ollama's one-touch start program is available
ollama run qwen3:30b-a3b
, or use the quantized version of llama.cpp - Developer Debugging: Load directly through the transformers library, note the setting device_map='auto' to realize multi-card auto-assignment.
Key configuration points:
- Memory Estimation: FP16 precision requires about 60GB of video memory, recommend A100/A40 and other professional-grade graphics cards.
- API Compatibility: Deployed to enable API endpoints in OpenAI format for easy integration with existing systems
- Mindset control: add /think or /no_think directives to requests for dynamic switching
For resource-constrained environments, preference can be given to small-scale, dense models such as 4B/8B, which can be run on consumer-grade graphics cards through 32K context windows and quantization techniques.
This answer comes from the articleQwen3 Released: A New Generation of Big Language Models for Thinking Deeply and Responding FastThe