CleanTool is a data preprocessing tool that accompanies the EduChat project, and its standard operating procedure is as follows:
- Input Preparation: Save the raw dialog data in JSON format, each record should contain
instruction(Instructions),input(Input),output(Output) Three fields - Basic cleaning: Execute command
python clean_tool.py --input data.json --output cleaned_data.json --gpu True, the tool will automatically:- Remove exact duplicate samples (based on MD5 hash)
- Filtering of low-quality data (via N-gram overlap and perplexity detection)
- Standardized text formatting (harmonized full/half corner notation, etc.)
- Advanced Options::
- Field filtering: add
--domain eduParameters may retain samples with high educational relevance - Length control:
--min_length 20Remove Too Short Response - Quality thresholds:
--quality_threshold 0.7Adjustment of determination criteria (range 0-1)
- Field filtering: add
It has been verified that the cleaned data can improve the model training efficiency by 30%, and reduce the error rate by 15% on tasks requiring rigor such as mathematical problem solving, etc. For non-technical users, the project repository provides a template of preset cleaning rules that can be directly applied.
This answer comes from the articleEduChat: Open Source Education Dialogue ModelThe





























