The following steps are required to deploy Qwen3-235B-A22B-Thinking-2507:
- environmental preparation: Hardware requirements include 88GB of video memory for the BF16 version, or 30GB of video memory for the FP8 version. Software requirements include Python 3.8+, PyTorch with CUDA support, and Hugging Face's transformers library (version ≥ 4.51.0).
- Model Download: Use
huggingface-cli download Qwen/Qwen3-235B-A22B-Thinking-2507
Download the model files (about 437.91GB for BF16 version and 220.20GB for FP8 version). - Loading Models: Use transformers to load the model:
AutoModelForCausalLM.from_pretrained
The following is an example of a specifiedtorch_dtype="auto"
cap (a poem)device_map="auto"
Automatic resource allocation. - Optimized Configuration: For local runs, inference performance can be optimized by reducing the context length (e.g., 32768 tokens) or using the sglang/vLLM framework.
For tool invocation functionality, you also need to configure the Qwen-Agent to define the tool interface.
This answer comes from the articleQwen3-235B-A22B-Thinking-2507: A large-scale language model to support complex reasoningThe