How to overcome memory overflow in multi-threaded crawling?

2025-09-05

1.6 K

Background to the issue

Highly concurrent crawling is prone to run out of memory leading to process termination.

Gradual start-up:Initially set num_workers=4, gradually increase to the upper limit of the system's tolerance.
Memory monitoring:Enable wandb to monitor memory usage
Batch control:Decrease num_selected_docs_per_iter value (recommended 2000-5000)
Resource segregation:Limiting Container Memory Usage with Docker

64GB RAM machines are recommended to have a worker count of no more than 32
When encountering an overflow first check if the fastText model is loaded into memory
Try modifying the chunksize parameter in drawl.py to reduce the amount of processing in a single pass