Background to the issue
Traditional crawlers crawl a large number of low-quality web pages, which affects the training effect of LLM.Crawl4LLM achieves content filtering through a scoring mechanism.
prescription
- Dual scoring system:Configure rating_methods in config to use both length and fasttext_score double filtering
- Model Selection:Download the recommended openhermes classifier model (bigram_200k_train.bin) for optimal evaluation results
- Sort Settings:Setting order to desc ensures that highly rated pages are prioritized for crawling
- Threshold Adjustment:Further optimize the screening criteria by modifying the scoring weight parameters in the YAML file
Effectiveness of implementation
Tests have shown that this method can reduce the necessary crawling by 79% while maintaining no degradation in model training. For special domain requirements, the fastText model can also be customized for training.
This answer comes from the articleCrawl4LLM: An Efficient Web Crawling Tool for LLM PretrainingThe