Background
LLM pre-training requires a large amount of high-quality data, and traditional web crawling suffers from data redundancy and inefficiency.Crawl4LLM provides an intelligent solution for filtering high-value content through algorithms.
Core Operating Procedures
- Configure intelligent filtering:Set selection_method to dclm_fasttext_score in the YAML file to enable pre-training of the evaluation model
- Adjust the crawling parameters:Control the number of threads through num_workers (16 threads for 16-core CPUs is recommended), and max_num_docs sets the document limit.
- Use SSD storage:Improve I/O performance by storing large datasets such as ClueWeb22 on SSDs
- Enable W&B monitoring:Set wandb:true to record the crawling process for later optimization
caveat
For first time use, you need to download the fastText classifier to the specified directory and make sure the Python version is ≥ 3.10. It is recommended to run it in a virtual environment to avoid dependency conflicts.
This answer comes from the articleCrawl4LLM: An Efficient Web Crawling Tool for LLM PretrainingThe































