WaterCrawl provides a variety of useful data output format options to meet the data processing needs of different scenarios:
- JSON format: Highly structured to facilitate follow-up and use of the program
- Markdown formatting: Preserve basic text structure and formatting for document processing
- MinIO storage: Support efficient storage and management of large-scale files
- API Direct Output: Real-time crawling results can be obtained via a RESTful interface
These formats are designed with the standardization needs of large language models for processing data in mind, as well as the ease of integration and use for developers. Users can specify the desired output format through a configuration file or in an API request parameter.
This answer comes from the articleWaterCrawl: transforming web content into data usable for large modelsThe