Scale effects of data engineering innovations
Qwen3 has 36 trillion tokens of pre-training data, twice as much as its predecessor Qwen2.5, covering high-quality content such as STEM, programming, and academic papers. The technical report reveals that its data construction consists of three key phases: basic training with 4K contexts (30 trillion tokens), knowledge-intensive data optimization (5 trillion tokens), and 32K-128K long context extended training. The data sources include PDF document parsing (accuracy 92.3%) and synthetic data generated by the Qwen2.5 series of models, in addition to generic web pages.
Quality improvement measures include:
- Optimizing Multimodal Text Extraction Using the Qwen2.5-VL Model
- Generating Millions of Examples of Mathematical Reasoning with Qwen2.5-Math
- Enhancing Code Data Diversity Based on Qwen2.5-Coder
- Implementation of a five-tier content security filtering mechanism
Benchmark tests show that the Qwen3-32B base model outperforms the Qwen2.5-72B version on professional reviews such as MATH and HumanEval, validating the decisive impact of data quality on model capability. This data advantage allows even small-scale models (e.g., 4B parameters) to handle tasks that traditionally require 70B parameter-level models.
This answer comes from the articleQwen3 Released: A New Generation of Big Language Models for Thinking Deeply and Responding FastThe