Current Position:fig. beginning " AI Answers

What is the exact procedure for data cleansing using the CleanTool tool?

2025-08-21

593

CleanTool is a data preprocessing tool that accompanies the EduChat project, and its standard operating procedure is as follows:

Input Preparation: Save the raw dialog data in JSON format, each record should containinstruction(Instructions),input(Input),output(Output) Three fields
Basic cleaning: Execute commandpython clean_tool.py --input data.json --output cleaned_data.json --gpu True, the tool will automatically:
- Remove exact duplicate samples (based on MD5 hash)
- Filtering of low-quality data (via N-gram overlap and perplexity detection)
- Standardized text formatting (harmonized full/half corner notation, etc.)
Advanced Options::
- Field filtering: add--domain eduParameters may retain samples with high educational relevance
- Length control:--min_length 20Remove Too Short Response
- Quality thresholds:--quality_threshold 0.7Adjustment of determination criteria (range 0-1)

It has been verified that the cleaned data can improve the model training efficiency by 30%, and reduce the error rate by 15% on tasks requiring rigor such as mathematical problem solving, etc. For non-technical users, the project repository provides a template of preset cleaning rules that can be directly applied.

This answer comes from the articleEduChat: Open Source Education Dialogue ModelThe

May not be reproduced without permission:AI productivity tools " What is the exact procedure for data cleansing using the CleanTool tool?