Overseas access: www.kdjingpai.com
Bookmark Us
Current Position:fig. beginning " AI Answers

How to improve the applicability of crawling results in LLM pre-training?

2025-09-05 1.6 K
Link directMobile View
qrcode

key issue

The raw crawled data needs to be processed to meet the model training requirements.

Optimization methods

  • Text extraction optimization:Add -clean_html parameter to remove page tags when running fetch_docs.py
  • Content Segmentation:Configuring max_length in YAML to avoid long paragraphs
  • Multi-language support:Evaluating non-English content using the multilingual fastText model
  • Sampling strategy:Alternating dclm_fasttext_score and random patterns to obtain data diversity

Effectiveness Verification

Document quality is sampled through access_data.py, with suggested checks including: subject relevance, text coherence, information density, and other metrics. Quality data should be satisfied at the same time:
1) fasttext_score ≥ 0.8
2) length ∈ [500,2000] characters

Recommended

Can't find AI tools? Try here!

Just type in the keyword Accessibility Bing SearchYou can quickly find all the AI tools on this site.

Top