Overseas access: www.kdjingpai.com
Bookmark Us
Current Position:fig. beginning " AI Answers

Crawl4LLM is an open source tool to optimize the efficiency of crawling web data for pre-training large models

2025-09-05 1.5 K

Crawl4LLM is a professional open source project jointly developed by Tsinghua University and Carnegie Mellon University, focusing on improving the efficiency of web page data acquisition in the pre-training phase of the Large Language Model (LLM). The tool is able to accurately assess the value of web pages for model training through an intelligent data selection algorithm, achieving a significant reduction in high-quality content screening and invalid crawling.

The core strengths are reflected in:

  • Improved work efficiency: experimental data show that the workload of 100 web pages can be reduced to 21 pages.
  • Algorithm innovation: using DCLM fastText classifier for content quality assessment, supporting dual scoring mechanism based on length and fasttext_score
  • Engineering implementation optimization: multi-threaded crawling engine design and SSD storage adaptation, can handle ClueWeb22 and other billion-scale data sets

The project has been open-sourced in GitHub, providing a complete code implementation and YAML configuration documentation, which meets the needs of academic research and is also suitable for industrial-scale application scenarios.

Recommended

Can't find AI tools? Try here!

Just type in the keyword Accessibility Bing SearchYou can quickly find all the AI tools on this site.

Top

en_USEnglish