Crawl4LLM is a professional open source project jointly developed by Tsinghua University and Carnegie Mellon University, focusing on improving the efficiency of web page data acquisition in the pre-training phase of the Large Language Model (LLM). The tool is able to accurately assess the value of web pages for model training through an intelligent data selection algorithm, achieving a significant reduction in high-quality content screening and invalid crawling.
The core strengths are reflected in:
- Improved work efficiency: experimental data show that the workload of 100 web pages can be reduced to 21 pages.
- Algorithm innovation: using DCLM fastText classifier for content quality assessment, supporting dual scoring mechanism based on length and fasttext_score
- Engineering implementation optimization: multi-threaded crawling engine design and SSD storage adaptation, can handle ClueWeb22 and other billion-scale data sets
The project has been open-sourced in GitHub, providing a complete code implementation and YAML configuration documentation, which meets the needs of academic research and is also suitable for industrial-scale application scenarios.
This answer comes from the articleCrawl4LLM: An Efficient Web Crawling Tool for LLM PretrainingThe




























