Overseas access: www.kdjingpai.com
Bookmark Us
Current Position:fig. beginning " AI Answers

What are the specific breakthroughs in multilingual support in Qwen3? What are the features of its training data strategy?

2025-08-24 1.7 K
Link directMobile View
qrcode

Technical realization of multilingual capabilities

Qwen3 Override119 languages and dialects, breakthrough performance in:

  • Full language coverage: Includes mainstream language families such as Indo-European (67), Sino-Tibetan (3), South Island (12), and even low-resource languages such as Luxembourgish and Assamese.
  • dialectal subdivision: Arabic supports 7 dialectal variants of Najdi/Egyptian/Moroccan etc.
  • hybrid coding: Effectively handles mixed input of Chinese/Japanese/Korean CJK characters and Latin letters.

Three innovations in training data strategies:

  1. Multiplication of data volumes: Pre-training token reaches 36 trillion (2x Qwen 2.5), with non-English data share boosted to 45%
  2. Multimodal Cleaning: Use Qwen2.5-VL to extract text from PDFs and other documents, and add it to the training after quality filtering.
  3. Synthetic Data Enhancement: Generate structured data such as code solutions, mathematical derivations, etc. with Qwen2.5-Math/Coder

The three-phase pre-training, with the S2 phase dedicated to increasing the proportion of knowledge-intensive data, and the S3 phase reinforcing contextual understanding in low-resource languages through long text fine-tuning, enabled Qwen3 to reach the GPT-3.5 level on the small-language task.

Recommended

Can't find AI tools? Try here!

Just type in the keyword Accessibility Bing SearchYou can quickly find all the AI tools on this site.

Top

en_USEnglish