Current Position:fig. beginning " AI Answers

Crawl4LLM is an open source tool to optimize the efficiency of crawling web data for pre-training large models

2025-09-05

1.5 K

Crawl4LLM is a professional open source project jointly developed by Tsinghua University and Carnegie Mellon University, focusing on improving the efficiency of web page data acquisition in the pre-training phase of the Large Language Model (LLM). The tool is able to accurately assess the value of web pages for model training through an intelligent data selection algorithm, achieving a significant reduction in high-quality content screening and invalid crawling.

The core strengths are reflected in:

Improved work efficiency: experimental data show that the workload of 100 web pages can be reduced to 21 pages.
Algorithm innovation: using DCLM fastText classifier for content quality assessment, supporting dual scoring mechanism based on length and fasttext_score
Engineering implementation optimization: multi-threaded crawling engine design and SSD storage adaptation, can handle ClueWeb22 and other billion-scale data sets

The project has been open-sourced in GitHub, providing a complete code implementation and YAML configuration documentation, which meets the needs of academic research and is also suitable for industrial-scale application scenarios.

This answer comes from the articleCrawl4LLM: An Efficient Web Crawling Tool for LLM PretrainingThe

May not be reproduced without permission:AI productivity tools " Crawl4LLM is an open source tool to optimize the efficiency of crawling web data for pre-training large models

Crawl4LLM is an open source tool to optimize the efficiency of crawling web data for pre-training large models

Related articles

Recommended

Can't find AI tools? Try here!

Popular AI tools

New Releases

Latest AI tools

Crawl4LLM is an open source tool to optimize the efficiency of crawling web data for pre-training large models

Related articles

Recommended

Can't find AI tools? Try here!

Popular AI tools

New Releases

Latest AI tools

Quick query station AI tool