Crawl4LLM supports multi-threaded high-speed crawling and large-scale dataset processing

2025-09-05

1.6 K

As a professional tool for LLM pre-training, Crawl4LLM is specifically optimized for big data scenarios in its engineering implementation.

System features are included:

Scalable architecture design: support 16 threads and above concurrency through num_workers parameter, the real test in the 16-core CPU environment crawling speed up 15 times.
Storage Optimization: Requires datasets such as ClueWeb22 to be stored on SSDs to avoid I/O bottlenecks on mechanical hard drives
Memory management: built-in work queue mechanism, single-task support for processing 20 million document size

In terms of usage recommendations, the development team recommends it:

For academic research, configuring num_selected_docs_per_iter to 10,000 is ideal.
Industrial-grade applications are recommended to enable wandb log monitoring, real-time tracking of crawling progress and resource consumption
Hundreds of gigabytes of space should be reserved in the output directory to store raw HTML and converted plain text.

These designs allow the tools to be adapted to different demand scenarios from lab to production environments.

Quick query station AI tool