Full Process Solution for Educational Data Cleansing
CleanTool offers a three-step data optimization method:
- Basic cleaning: Execute standard commands to remove duplicates and low-quality data
python clean_tool.py --input raw_data.json --output stage1.json --gpu True - domain enhancement:: Data containing educational characteristics such as "pedagogical" and "cognitive" are retained through the -edu_keywords parameter.
python clean_tool.py --input stage1.json --output final_data.json --edu_keywords teaching,learning - quality assurance: Generate data quality reports using the -metrics parameter (includes metrics such as lexical density, thematic coherence, etc.)
Suggestions for special scenarios:
- Counseling data: adding the -sentiment_filter parameter preserves emotionally rich conversations
- Multilingual data: language separation with -lang en/zh parameters
- Large-scale processing: use -batch_size 1024 to improve processing efficiency
This answer comes from the articleEduChat: Open Source Education Dialogue ModelThe































