WaterCrawl provides three standard output formats, JSON/Markdown/CSV, which enable structured presentation of content through a format conversion engine. The JSON format retains the original DOM hierarchy and metadata intact and is suitable for direct consumption by machine learning pipelines; the Markdown format optimizes readability and is ideal for knowledge base construction; and the CSV format is easy to import into Excel for business analysis.
The core technology uses Scrapy's Item Pipeline architecture, which dynamically transforms data through a format renderer. In the news aggregation project, developers can choose to generate JSON and Markdown outputs at the same time: the former for recommender systems to analyze keyword co-occurrence, and the latter for CMS content publishing. Tests show that the average time to convert 1MB of web data is only 120ms, which is 3 times faster than traditional solutions.
Specifically, the system supports direct storage of conversion result files via MinIO and generation of pre-signed download links. A medical research organization uses this feature to automatically convert crawled clinical guidelines into standard Markdown and then synchronize them to GitBook, building an up-to-date industry knowledge center.
This answer comes from the articleWaterCrawl: transforming web content into data usable for large modelsThe