Efficient solution for extracting web page data
To realize the extraction of training data suitable for LLM from complex web pages, WaterCrawl provides a complete toolchain and method of operation:
- Using predefined crawling rules: Filter irrelevant content (e.g. script/style) by configuring exclude_tags in the pageOptions parameter, and grab the target tags (h1/p, etc.) precisely with include_tags.
- Intelligent Content Extraction Function: Enable the only_main_content=true parameter to automatically recognize and retain the main content of the page, removing distracting elements such as headers and footers.
- Multi-format output support: results can be directly converted to LLM-friendly JSON or Markdown format, maintaining the structured character of the document
Practical steps:
- Submit a JSON request containing the target URL and extraction rules via the API
- The system automatically performs crawling tasks and content cleaning
- Select to download the processed structured data file
This answer comes from the articleWaterCrawl: transforming web content into data usable for large modelsThe