Operation process details
The use of Crawl4LLM is divided into three key steps:
- Configuring a Crawl Task::
- Create a YAML configuration file in the configs directory
- Set key parameters such as dataset path, number of threads, maximum number of documents, etc.
- recommended choice
dclm_fasttext_scoreas selection_method
- Running the crawler: Implementation
python crawl.py crawl --config configs/my_config.yaml - Data Capture::
- utilization
fetch_docs.pyConvert document IDs to text - transferring entity
access_data.pyChecking the content of a specific document
- utilization
practical skill
- Enable wandb logging for easy analysis of the crawling process
- Recommended settings for 16-core CPUs
num_workers:16 - It is recommended to reserve hundreds of gigabytes of disk space when processing billions of data
- SSD storage can significantly speed up processing of large-scale data sets
This answer comes from the articleCrawl4LLM: An Efficient Web Crawling Tool for LLM PretrainingThe































