Automated preparation program for AI training data
Large model fine-tuning requires a large amount of structured technical documentation, but manual collection faces challenges such as dispersed sources, confusing formats, and difficulty in cleaning. devDocs provides an end-to-end solution:
- batch file collection: Support for crawling multiple technical documentation sites at the same time
- standard output: Generate training usable JSON format directly
- quality control: Crawling Integrity through Log Analysis
Implement the process:
- Create URL list file urls.txt (one document address per line)
- Run the parallel crawl command:
./scripts/batch_crawl.sh urls.txt 3(3 indicates depth) - Checking data quality with view_result.sh
- Use JSON files within drawl_results directly for model training
Optimization Tips:
- Depth setting: 5 layers for conceptual documentation, 3 layers for API documentation
- Filtering ads and more with the selective_crawl.json configuration file
- Monitor resource usage in conjunction with check_mcp_health.sh
Efficiency Comparison:It takes 2 weeks to prepare 1000 pages of training data in traditional manual way, this solution can be completed in 2 hours with more standardized data structure.
This answer comes from the articleDevDocs: an MCP service for quickly crawling and organizing technical documentationThe































