Overseas access: www.kdjingpai.com
Bookmark Us
Current Position:fig. beginning " AI Answers

How to solve the problem of out of memory when running Qwen3-235B-A22B-Thinking-2507 model locally?

2025-08-20 344

Practical solutions for solving out-of-memory problems

Insufficient memory is a common challenge when running large language models like Qwen3-235B-A22B-Thinking-2507 locally. The following are a variety of effective solutions:

  • Quantized version with FP8: The model offers a FP8 version (~220.20GB), which reduces memory requirements by nearly 50% compared to the BF16 version (437.91GB), requiring only ~30GB of memory
  • Adjusting Context Length: the default 256K context consumes a lot of memory, which can be reduced to 32768 tokens to significantly reduce the memory footprint
  • Using an efficient reasoning framework: vLLM (≥0.8.5) or sglang (≥0.4.6.post1) are recommended, which optimize memory management and inference efficiency
  • Multi-GPU Parallelism: Distribute the model across multiple GPUs with the tensor-parallel-size parameter
  • CPU offloading technology: some calculations can be offloaded to system memory using frameworks such as llama.cpp

In practice, it is recommended to first try the following commands to reduce memory requirements:
python -m sglang.launch_server -model-path Qwen/Qwen3-235B-A22B-Thinking-2507 -tp 8 -context-length 32768

Recommended

Can't find AI tools? Try here!

Just type in the keyword Accessibility Bing SearchYou can quickly find all the AI tools on this site.

Top

en_USEnglish