How to deploy the Qwen3-235B-A22B-Thinking-2507 model?

2025-08-20

352

The following steps are required to deploy Qwen3-235B-A22B-Thinking-2507:

environmental preparation: Hardware requirements include 88GB of video memory for the BF16 version, or 30GB of video memory for the FP8 version. Software requirements include Python 3.8+, PyTorch with CUDA support, and Hugging Face's transformers library (version ≥ 4.51.0).
Model Download: Usehuggingface-cli download Qwen/Qwen3-235B-A22B-Thinking-2507Download the model files (about 437.91GB for BF16 version and 220.20GB for FP8 version).
Loading Models: Use transformers to load the model:AutoModelForCausalLM.from_pretrainedThe following is an example of a specifiedtorch_dtype="auto"cap (a poem)device_map="auto"Automatic resource allocation.
Optimized Configuration: For local runs, inference performance can be optimized by reducing the context length (e.g., 32768 tokens) or using the sglang/vLLM framework.

For tool invocation functionality, you also need to configure the Qwen-Agent to define the tool interface.

Quick query station AI tool