Environmental preparation steps
The following system configuration is required to install Crawl4LLM:
- Python Requirements: Ensure that Python 3.10 or higher is installed
- Virtual Environment Creation::
- Linux/Mac.
python -m venv crawl4llm_env && source crawl4llm_env/bin/activate - Windows.
python -m venv crawl4llm_env && crawl4llm_envScriptsactivate
- Linux/Mac.
- Source code acquisition::
git clone https://github.com/cxcscmu/Crawl4LLM.git - Dependent Installation: Go to the project directory and execute
pip install -r requirements.txt - Classifier Download: Place the DCLM fastText classifier model file into the
fasttext_scorers/catalogs
special attention
- Access to ClueWeb22 datasets needs to be requested in advance
- It is recommended to store large-scale datasets on SSDs to improve IO performance
- Ensure that the network is free to download all dependency packages
This answer comes from the articleCrawl4LLM: An Efficient Web Crawling Tool for LLM PretrainingThe































