Overseas access: www.kdjingpai.com
Bookmark Us
Current Position:fig. beginning " AI Answers

What is the exact procedure for data cleansing using the CleanTool tool?

2025-08-21 474
Link directMobile View
qrcode

CleanTool is a data preprocessing tool that accompanies the EduChat project, and its standard operating procedure is as follows:

  1. Input Preparation: Save the raw dialog data in JSON format, each record should containinstruction(Instructions),input(Input),output(Output) Three fields
  2. Basic cleaning: Execute commandpython clean_tool.py --input data.json --output cleaned_data.json --gpu True, the tool will automatically:
    • Remove exact duplicate samples (based on MD5 hash)
    • Filtering of low-quality data (via N-gram overlap and perplexity detection)
    • Standardized text formatting (harmonized full/half corner notation, etc.)
  3. Advanced Options::
    • Field filtering: add--domain eduParameters may retain samples with high educational relevance
    • Length control:--min_length 20Remove Too Short Response
    • Quality thresholds:--quality_threshold 0.7Adjustment of determination criteria (range 0-1)

It has been verified that the cleaned data can improve the model training efficiency by 30%, and reduce the error rate by 15% on tasks requiring rigor such as mathematical problem solving, etc. For non-technical users, the project repository provides a template of preset cleaning rules that can be directly applied.

Recommended

Can't find AI tools? Try here!

Just type in the keyword Accessibility Bing SearchYou can quickly find all the AI tools on this site.

Top

en_USEnglish