Multidimensional diversity characterizing the dataset
The Chinese DeepSeek-R1 distillation dataset achieves excellent diversity through well-designed data composition. It is mainly manifested in three dimensions: firstly, the type diversity, which contains strict mathematical operation data, complex logical reasoning data, as well as all kinds of general knowledge data; secondly, the source diversity, which is derived from multiple types of scenarios, such as professional Q&A in Zhihu, daily sharing in Xiaohongshu, etc.; and lastly, the difficulty diversity, which is covered from the basic computation to the advanced reasoning. This multiple diversity design allows the dataset to support:
- Basic text categorization tasks
- Complex question answering system
- Mathematical Computing Skills Assessment
- Multi-Round Dialog Modeling
Depending on the specific needs, researchers can select specific types of data through the categorization and filtering functions of the dataset, or use a combination of types of data to get the best results.
This answer comes from the articleChinese based full-blooded DeepSeek-R1 distillation dataset, supports Chinese R1 distillation SFT datasetThe