Current Position:fig. beginning " AI Answers

How to improve the applicability of crawling results in LLM pre-training?

2025-09-05

AI Answers

1.6 K

Link directMobile View

key issue

The raw crawled data needs to be processed to meet the model training requirements.

Optimization methods

Text extraction optimization:Add -clean_html parameter to remove page tags when running fetch_docs.py
Content Segmentation:Configuring max_length in YAML to avoid long paragraphs
Multi-language support:Evaluating non-English content using the multilingual fastText model
Sampling strategy:Alternating dclm_fasttext_score and random patterns to obtain data diversity

Effectiveness Verification

Document quality is sampled through access_data.py, with suggested checks including: subject relevance, text coherence, information density, and other metrics. Quality data should be satisfied at the same time:
1) fasttext_score ≥ 0.8
2) length ∈ [500,2000] characters

This answer comes from the articleCrawl4LLM: An Efficient Web Crawling Tool for LLM PretrainingThe

May not be reproduced without permission:AI productivity tools " How to improve the applicability of crawling results in LLM pre-training?

How to improve the applicability of crawling results in LLM pre-training?

key issue

Optimization methods

Effectiveness Verification

Recommended

Can't find AI tools? Try here!

Popular AI tools

New Releases

Latest AI tools

How to improve the applicability of crawling results in LLM pre-training?

key issue

Optimization methods

Effectiveness Verification

Recommended

Can't find AI tools? Try here!

Popular AI tools

New Releases

Latest AI tools

Quick query station AI tool