Practical solutions for solving out-of-memory problems
Insufficient memory is a common challenge when running large language models like Qwen3-235B-A22B-Thinking-2507 locally. The following are a variety of effective solutions:
- Quantized version with FP8: The model offers a FP8 version (~220.20GB), which reduces memory requirements by nearly 50% compared to the BF16 version (437.91GB), requiring only ~30GB of memory
- Adjusting Context Length: the default 256K context consumes a lot of memory, which can be reduced to 32768 tokens to significantly reduce the memory footprint
- Using an efficient reasoning framework: vLLM (≥0.8.5) or sglang (≥0.4.6.post1) are recommended, which optimize memory management and inference efficiency
- Multi-GPU Parallelism: Distribute the model across multiple GPUs with the tensor-parallel-size parameter
- CPU offloading technology: some calculations can be offloaded to system memory using frameworks such as llama.cpp
In practice, it is recommended to first try the following commands to reduce memory requirements:
python -m sglang.launch_server -model-path Qwen/Qwen3-235B-A22B-Thinking-2507 -tp 8 -context-length 32768
This answer comes from the articleQwen3-235B-A22B-Thinking-2507: A large-scale language model to support complex reasoningThe