Typical pain points
When directly using pre-trained models such as BERT to deal with multi-source heterogeneous data, there are problems such as large differences in text length and noise affecting the classification effect.
Optimization solutions
- Dynamic Segmentation:
- Setting up data for math classes
max_length=256 - Enabled for Little Red Book short textbooks
truncation='only_first'
- Setting up data for math classes
- Noise filtering:
- Sample weighting using the category field that comes with the dataset
- pass (a bill or inspection etc)
texthero.preprocessing.remove_digitsClean up digital noise
- Enhanced representation:
- Add a DomainAdaptation layer after the last layer of the BERT
- Adoption of Knowledgeable Long Text
MaxPoolingsubstitute forCLSbe tactful
Practice Recommendations
Recommended Usedatasets.DatasetDictWhen dividing the training/validation set, the 8:1:1 ratio is maintained and the validation set should cover all data categories (math/logic/general).
This answer comes from the articleChinese based full-blooded DeepSeek-R1 distillation dataset, supports Chinese R1 distillation SFT datasetThe































