Overseas access: www.kdjingpai.com
Bookmark Us

Qwen3-235B-A22B-Thinking-2507 is a large-scale language model developed by the Alibaba Cloud Qwen team, released on July 25, 2025 and hosted on the Hugging Face platform. It focuses on complex reasoning tasks, supports context lengths up to 256K (262,144) tokens, and is suitable for handling logical reasoning, math, science, programming, and academic tasks. The model uses the Mixed Expert (MoE) architecture with 235 billion parameters and 22 billion parameters activated per inference, balancing performance and efficiency. It excels among open-source inference models and is particularly suited for application scenarios that require deep thinking and long context processing. Users can use it with a variety of inference frameworks such as transformers, sglang, and vLLM Deployment models that also support local runs.

Function List

  • Supports ultra-long contextual understanding of 256K tokens for processing complex documents or multiple rounds of dialog.
  • Provides strong logical reasoning for math, science, and academic problems.
  • Specialize in programming tasks, support code generation and debugging.
  • Integrate tool invocation functionality to simplify external tool interactions through Qwen-Agent.
  • Support for more than 100 languages, suitable for multilingual command following and translation.
  • A quantized version of FP8 is available to reduce hardware requirements and optimize inference performance.
  • Compatible with a variety of inference frameworks such as transformers, sglang, vLLM and llama.cpp.

Using Help

Installation and Deployment

To use Qwen3-235B-A22B-Thinking-2507, you need to prepare a high-performance computing environment due to its large model files (about 437.91GB for the BF16 version and 220.20GB for the FP8 version). The following are the detailed installation steps:

  1. environmental preparation::
    • Make sure the hardware meets the requirements: 88GB of video memory is recommended for the BF16 version, and about 30GB of video memory for the FP8 version.
    • Install Python 3.8+ and PyTorch, a GPU environment with CUDA support is recommended.
    • Install the Hugging Face transformers library, version ≥ 4.51.0 to avoid compatibility issues:
      pip install transformers>=4.51.0
      
    • Optionally install sglang (≥0.4.6.post1) or vLLM (≥0.8.5) to support efficient reasoning:
      pip install sglang>=0.4.6.post1 vllm>=0.8.5
      
  2. Download model::
    • Download the model from the Hugging Face repository:
      huggingface-cli download Qwen/Qwen3-235B-A22B-Thinking-2507
      
    • For FP8 version, download Qwen3-235B-A22B-Thinking-2507-FP8:
      huggingface-cli download Qwen/Qwen3-235B-A22B-Thinking-2507-FP8
      
  3. local operation::
    • Use transformers to load the model:
      from transformers import AutoModelForCausalLM, AutoTokenizer
      model_name = "Qwen/Qwen3-235B-A22B-Thinking-2507"
      tokenizer = AutoTokenizer.from_pretrained(model_name)
      model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
      
    • To avoid running out of memory, the context length can be reduced (e.g., 32768 tokens):
      python -m sglang.launch_server --model-path Qwen/Qwen3-235B-A22B-Thinking-2507 --tp 8 --context-length 32768 --reasoning-parser deepseek-r1
      
  4. Tool Call Configuration::
    • Simplify tool calls with Qwen-Agent:
      from qwen_agent.agents import Assistant
      llm_cfg = {
      'model': 'qwen3-235b-a22b-thinking-2507',
      'model_type': 'qwen_dashscope'
      }
      tools = [{'mcpServers': {'time': {'command': 'uvx', 'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai']}}}]
      bot = Assistant(llm=llm_cfg, function_list=tools)
      messages = [{'role': 'user', 'content': '获取当前时间'}]
      for responses in bot.run(messages=messages):
      print(responses)
      

Main Functions

  • complex inference: The model has think mode enabled by default and the output contains <think> Tags, suitable for solving mathematical or logical problems. For example, enter "prove Fermat's Little Theorem" and the model will generate a step-by-step reasoning process.
  • Long Context Processing: Supports 256K tokens, suitable for analyzing long documents. After inputting long text, the model can extract key information or answer relevant questions.
  • Programming Support: Enter a code snippet or a question, such as "Write a Python sorting algorithm", and the model generates the full code and explains the logic.
  • Tool Call: With Qwen-Agent, models can invoke external tools, such as getting time or executing web requests, simplifying complex tasks.

caveat

  • In inference mode, a context length ≥ 131072 is recommended to ensure performance.
  • Avoid using greedy decoding, which may result in duplicate output.
  • For local operation, it is recommended to use the Ollama or LMStudio, but the context length needs to be adjusted to avoid looping problems.

application scenario

  1. academic research
    Researchers can use the model to analyze long papers, extract key arguments or validate mathematical formulas. Its 256K context length supports processing entire documents and is suitable for literature reviews or cross-chapter analysis.
  2. Programming
    Developers can use models to generate code, debug programs, or optimize algorithms. For example, enter a complex algorithm requirement and the model will provide the code and explain the steps to implement it.
  3. multilingual translation
    Enterprises can use the model for multilingual document translation or instruction processing, supporting more than 100 languages, suitable for cross-border communication or localization tasks.
  4. Educational support
    Students and teachers can use models to answer math and science questions or to generate instructional materials. The reasoning power of models helps explain complex concepts.

QA

  1. What inference frameworks does the model support?
    Support for transformers, sglang, vLLM, Ollama, LMStudio and llama.cpp. The latest version is recommended to ensure compatibility.
  2. How do I deal with out-of-memory problems?
    Reduce the context length to 32768, or use the FP8 version to reduce memory requirements. Multiple GPU resources can also be allocated via the tensor-parallel-size parameter.
  3. How do I enable the tool call feature?
    Using the Qwen-Agent Configuration Tool, define the MCP files or built-in tools, the model can automatically call external functions.
0Bookmarked
0kudos

Recommended

Can't find AI tools? Try here!

Just type in the keyword Accessibility Bing SearchYou can quickly find all the AI tools on this site.

inbox

Contact Us

Top

en_USEnglish