Background to the issue
Highly concurrent crawling is prone to run out of memory leading to process termination.
prescription
- Gradual start-up:Initially set num_workers=4, gradually increase to the upper limit of the system's tolerance.
- Memory monitoring:Enable wandb to monitor memory usage
- Batch control:Decrease num_selected_docs_per_iter value (recommended 2000-5000)
- Resource segregation:Limiting Container Memory Usage with Docker
Optimization Recommendations
- 64GB RAM machines are recommended to have a worker count of no more than 32
- When encountering an overflow first check if the fastText model is loaded into memory
- Try modifying the chunksize parameter in drawl.py to reduce the amount of processing in a single pass
This answer comes from the articleCrawl4LLM: An Efficient Web Crawling Tool for LLM PretrainingThe































