key issue
The raw crawled data needs to be processed to meet the model training requirements.
Optimization methods
- Text extraction optimization:Add -clean_html parameter to remove page tags when running fetch_docs.py
- Content Segmentation:Configuring max_length in YAML to avoid long paragraphs
- Multi-language support:Evaluating non-English content using the multilingual fastText model
- Sampling strategy:Alternating dclm_fasttext_score and random patterns to obtain data diversity
Effectiveness Verification
Document quality is sampled through access_data.py, with suggested checks including: subject relevance, text coherence, information density, and other metrics. Quality data should be satisfied at the same time:
1) fasttext_score ≥ 0.8
2) length ∈ [500,2000] characters
This answer comes from the articleCrawl4LLM: An Efficient Web Crawling Tool for LLM PretrainingThe































