How to use Crawl4LLM for web crawling and data extraction?

2025-09-05

1.6 K

Operation process details

The use of Crawl4LLM is divided into three key steps:

Configuring a Crawl Task::
- Create a YAML configuration file in the configs directory
- Set key parameters such as dataset path, number of threads, maximum number of documents, etc.
- recommended choicedclm_fasttext_scoreas selection_method
Running the crawler: Implementationpython crawl.py crawl --config configs/my_config.yaml
Data Capture::
- utilizationfetch_docs.pyConvert document IDs to text
- transferring entityaccess_data.pyChecking the content of a specific document

Enable wandb logging for easy analysis of the crawling process
Recommended settings for 16-core CPUsnum_workers:16
It is recommended to reserve hundreds of gigabytes of disk space when processing billions of data
SSD storage can significantly speed up processing of large-scale data sets