Current Position:fig. beginning " AI Professional Tools

WaterCrawl: transforming web content into data usable for big models</trp-post-container

WaterCrawl: transforming web content into data usable for large models

2025-07-18

16 0

https://github.com/watercrawl/watercrawl

WaterCrawl is a powerful open source web crawler designed to help users extract data from web pages and transform it into a data format suitable for Large Language Model (LLM) processing. It is based on Python development , combined with Django, Scrapy and Celery technology to support efficient web crawling and data processing.WaterCrawl provides multiple languages SDK , including Node.js, Go and PHP , developers can easily integrate into different projects . WaterCrawl can be deployed quickly via Docker or customized using the provided APIs. It is designed to simplify the web data extraction process for developers and organizations that deal with large amounts of web content.

Function List

Efficient Web Crawling: Support customizing crawling depth, speed and target content to get web data quickly.
Data Extraction and Cleaning: Automatically filter irrelevant tags (e.g. scripts, styles), extract main content and support multiple output formats (e.g. JSON, Markdown).
Multi-language SDK support: Provides SDKs for Node.js, Go, PHP and Python to meet different development needs.
Real-time progress monitoring: Provides real-time status updates of crawling tasks via Celery.
Docker Deployment Support: Quickly build local or production environments with Docker Compose.
MinIO Integration: Supports file storage and download, suitable for handling large-scale data.
Plugin extensions: provide a plugin framework that allows developers to customize crawling and processing logic.
API Integration: Supports managing crawling tasks, getting results or downloading data via API.

Using Help

The installation and use of WaterCrawl is aimed at developers and technical teams, and is suitable for users familiar with Python and Docker. Here is the detailed installation and usage process.

Installation process

clone warehouse
First, clone WaterCrawl's GitHub repository locally:

git clone https://github.com/watercrawl/watercrawl.git
cd watercrawl

This will download the WaterCrawl source code to your local environment.

Installing Docker
Ensure that Docker and Docker Compose are installed on your system. if not, visit the Docker Official Website Download and install.
Configuring Environment Variables
go into docker directory, copy the sample environment configuration file:
```
cd docker
cp .env.example .env
```
compiler .env file to configure the necessary parameters, such as database connection, MinIO storage address, etc. If deployed in a non-local environment, such as a cloud server, the MinIO configuration must be updated:
```
MINIO_EXTERNAL_ENDPOINT=your-domain.com
MINIO_BROWSER_REDIRECT_URL=http://your-domain.com/minio-console/
MINIO_SERVER_URL=http://your-domain.com/
```
These settings ensure that the file upload and download functions work properly. More configuration details can be found in DEPLOYMENT.md Documentation.
Starting the Docker Container
Start the service using Docker Compose:
```
docker compose up -d
```
This will start WaterCrawl's core services, including the Django backend, Scrapy crawler, and MinIO storage.
Access to applications
Open your browser and visit http://localhost Check that the service is running correctly. If deployed on a remote server, replace the localhost is your domain name or IP address.

Functional operation flow

The core function of WaterCrawl is web crawling and data extraction, the following are the detailed steps of the main functions.

web crawler

WaterCrawl uses the Scrapy framework for web crawling, and users can initiate crawling tasks via API or command line. For example, to initiate a crawl task:

{
"url": "https://example.com",
"pageOptions": {
"exclude_tags": ["script", "style"],
"include_tags": ["p", "h1", "h2"],
"wait_time": 1000,
"only_main_content": true,
"include_links": true,
"timeout": 15000
},
"sync": true,
"download": true
}

Parameter description::
- exclude_tags: Filter out specified HTML tags (such as scripts and styles).
- include_tags: Extracts only the contents of the specified label.
- only_main_content: Extract only the main content of the page, ignoring sidebars, ads, etc.
- include_links: Whether or not it contains links in the page.
- timeout: Set the crawl timeout in milliseconds.
procedure::
1. Send the above JSON requests using an SDK (such as Python or Node.js) or directly through an API.
2. WaterCrawl returns the crawl results with extracted text, links or other specified data.
3. Results can be saved in JSON, Markdown or other formats.

Data Download

WaterCrawl supports saving crawl results to a file. For example, to download a sitemap:

{
"crawlRequestId": "uuid-of-crawl-request",
"format": "json"
}

procedure::
1. Get the ID of the crawling task (viewed via the API or task list).
2. Use the API to send a download request, specifying the output format (e.g. JSON, Markdown).
3. Downloaded files are stored in MinIO and are available by visiting the MinIO console.

real time monitoring

WaterCrawl uses Celery to provide task status monitoring. Users can query the progress of a task via the API:

curl https://app.watercrawl.dev/api/tasks/<task_id>

The returned task status includes "in progress", "complete" or "failed", to help users grasp the progress of crawling in real time.

Plug-in Development

WaterCrawl supports plugin extensions that allow developers to create custom functionality based on the provided Python plugin framework:

mounting watercrawl-plugin Package:
```
pip install watercrawl-plugin
```
Develop plug-ins using the provided abstract classes and interfaces to define crawling logic or data processing flows.
Integrate the plug-in into the WaterCrawl main program to run custom tasks.

caveat

If deployed in a production environment, make sure to update the .env database and MinIO configurations in the file, otherwise file uploads or downloads may fail.
Avoid open discussions about security on GitHub, send questions to support@watercrawl.devThe
It is recommended to check WaterCrawl's GitHub repository regularly for the latest updates and fixes.

application scenario

Large model data preparation
Developers need to provide high-quality training data for large language models. WaterCrawl can quickly crawl text content from target websites, clean irrelevant data, and output structured JSON or Markdown files, which are suitable for direct use in model training.
Market Research
Enterprises need to analyze competitors' website content or industry dynamics. WaterCrawl can batch crawl target web pages and extract key information (e.g., product descriptions, prices, news) to help companies quickly aggregate market data.
content aggregation
News or blogging platforms need to collect articles from multiple websites.WaterCrawl supports custom crawling rules to automatically extract article titles, body text and links to generate a uniformly formatted content library.
SEO Optimization
SEO experts can use WaterCrawl to crawl sitemaps and page links, analyze site structure and content distribution, and optimize search engine rankings.

QA

What programming language SDKs does WaterCrawl support?
WaterCrawl provides SDKs for Node.js, Go, PHP and Python, covering a wide range of development scenarios. Each SDK supports full API functionality for easy integration.
How do I handle a failed crawl task?
Check the status of the task to verify that it did not fail due to a timeout or network issue. Adjustment timeout parameters or check the anti-crawl mechanism of the target site. If necessary, contact the support@watercrawl.devThe
Are non-technical users supported?
WaterCrawl is intended for developers and requires a basic knowledge of Python or Docker. Non-technical users may need assistance from a technical team to deploy and use it.
How can I extend the functionality of WaterCrawl?
utilization watercrawl-plugin The package develops custom plug-ins that define specific crawling or data processing logic that is integrated to run in the main program.

AI open source project Document Extraction and Cleaning

Chief AI Sharing Circle " WaterCrawl: transforming web content into data usable for large models Posted on 2025-07-18, if you find the URL is out of date, or inaccessible, please contact us.

0Bookmarked

0kudos

WaterCrawl: transforming web content into data usable for large models

Function List

Using Help

Installation process

Functional operation flow

web crawler

Data Download

real time monitoring

Plug-in Development

caveat

application scenario

QA

Related articles

Recommended

Can't find AI tools? Try here!

Recommended Tools

New Releases

WaterCrawl: transforming web content into data usable for large models

Function List

Using Help

Installation process

Functional operation flow

web crawler

Data Download

real time monitoring

Plug-in Development

caveat

application scenario

QA

Related articles

Recommended

Can't find AI tools? Try here!

Recommended Tools

New Releases

Quick query station AI tool