Overseas access: www.kdjingpai.com
Ctrl + D Favorites

WaterCrawl is a powerful open source web crawler designed to help users extract data from web pages and transform it into a data format suitable for Large Language Model (LLM) processing. It is based on Python development , combined with Django, Scrapy and Celery technology to support efficient web crawling and data processing.WaterCrawl provides multiple languages SDK , including Node.js, Go and PHP , developers can easily integrate into different projects . WaterCrawl can be deployed quickly via Docker or customized using the provided APIs. It is designed to simplify the web data extraction process for developers and organizations that deal with large amounts of web content.

Function List

  • Efficient Web Crawling: Support customizing crawling depth, speed and target content to get web data quickly.
  • Data Extraction and Cleaning: Automatically filter irrelevant tags (e.g. scripts, styles), extract main content and support multiple output formats (e.g. JSON, Markdown).
  • Multi-language SDK support: Provides SDKs for Node.js, Go, PHP and Python to meet different development needs.
  • Real-time progress monitoring: Provides real-time status updates of crawling tasks via Celery.
  • Docker Deployment Support: Quickly build local or production environments with Docker Compose.
  • MinIO Integration: Supports file storage and download, suitable for handling large-scale data.
  • Plugin extensions: provide a plugin framework that allows developers to customize crawling and processing logic.
  • API Integration: Supports managing crawling tasks, getting results or downloading data via API.

Using Help

The installation and use of WaterCrawl is aimed at developers and technical teams, and is suitable for users familiar with Python and Docker. Here is the detailed installation and usage process.

Installation process

  1. clone warehouse
    First, clone WaterCrawl's GitHub repository locally:

    git clone https://github.com/watercrawl/watercrawl.git
    cd watercrawl
    

This will download the WaterCrawl source code to your local environment.

  1. Installing Docker
    Ensure that Docker and Docker Compose are installed on your system. if not, visit the Docker Official Website Download and install.
  2. Configuring Environment Variables
    go into docker directory, copy the sample environment configuration file:

    cd docker
    cp .env.example .env
    

    compiler .env file to configure the necessary parameters, such as database connection, MinIO storage address, etc. If deployed in a non-local environment, such as a cloud server, the MinIO configuration must be updated:

    MINIO_EXTERNAL_ENDPOINT=your-domain.com
    MINIO_BROWSER_REDIRECT_URL=http://your-domain.com/minio-console/
    MINIO_SERVER_URL=http://your-domain.com/
    

    These settings ensure that the file upload and download functions work properly. More configuration details can be found in DEPLOYMENT.md Documentation.

  3. Starting the Docker Container
    Start the service using Docker Compose:

    docker compose up -d
    

    This will start WaterCrawl's core services, including the Django backend, Scrapy crawler, and MinIO storage.

  4. Access to applications
    Open your browser and visit http://localhost Check that the service is running correctly. If deployed on a remote server, replace the localhost is your domain name or IP address.

Functional operation flow

The core function of WaterCrawl is web crawling and data extraction, the following are the detailed steps of the main functions.

web crawler

WaterCrawl uses the Scrapy framework for web crawling, and users can initiate crawling tasks via API or command line. For example, to initiate a crawl task:

{
"url": "https://example.com",
"pageOptions": {
"exclude_tags": ["script", "style"],
"include_tags": ["p", "h1", "h2"],
"wait_time": 1000,
"only_main_content": true,
"include_links": true,
"timeout": 15000
},
"sync": true,
"download": true
}
  • Parameter description::
    • exclude_tags: Filter out specified HTML tags (such as scripts and styles).
    • include_tags: Extracts only the contents of the specified label.
    • only_main_content: Extract only the main content of the page, ignoring sidebars, ads, etc.
    • include_links: Whether or not it contains links in the page.
    • timeout: Set the crawl timeout in milliseconds.
  • procedure::
    1. Send the above JSON requests using an SDK (such as Python or Node.js) or directly through an API.
    2. WaterCrawl returns the crawl results with extracted text, links or other specified data.
    3. Results can be saved in JSON, Markdown or other formats.

Data Download

WaterCrawl supports saving crawl results to a file. For example, to download a sitemap:

{
"crawlRequestId": "uuid-of-crawl-request",
"format": "json"
}
  • procedure::
    1. Get the ID of the crawling task (viewed via the API or task list).
    2. Use the API to send a download request, specifying the output format (e.g. JSON, Markdown).
    3. Downloaded files are stored in MinIO and are available by visiting the MinIO console.

real time monitoring

WaterCrawl uses Celery to provide task status monitoring. Users can query the progress of a task via the API:

curl https://app.watercrawl.dev/api/tasks/<task_id>

The returned task status includes "in progress", "complete" or "failed", to help users grasp the progress of crawling in real time.

Plug-in Development

WaterCrawl supports plugin extensions that allow developers to create custom functionality based on the provided Python plugin framework:

  1. mounting watercrawl-plugin Package:
    pip install watercrawl-plugin
    
  2. Develop plug-ins using the provided abstract classes and interfaces to define crawling logic or data processing flows.
  3. Integrate the plug-in into the WaterCrawl main program to run custom tasks.

caveat

  • If deployed in a production environment, make sure to update the .env database and MinIO configurations in the file, otherwise file uploads or downloads may fail.
  • Avoid open discussions about security on GitHub, send questions to support@watercrawl.devThe
  • It is recommended to check WaterCrawl's GitHub repository regularly for the latest updates and fixes.

application scenario

  1. Large model data preparation
    Developers need to provide high-quality training data for large language models. WaterCrawl can quickly crawl text content from target websites, clean irrelevant data, and output structured JSON or Markdown files, which are suitable for direct use in model training.
  2. Market Research
    Enterprises need to analyze competitors' website content or industry dynamics. WaterCrawl can batch crawl target web pages and extract key information (e.g., product descriptions, prices, news) to help companies quickly aggregate market data.
  3. content aggregation
    News or blogging platforms need to collect articles from multiple websites.WaterCrawl supports custom crawling rules to automatically extract article titles, body text and links to generate a uniformly formatted content library.
  4. SEO Optimization
    SEO experts can use WaterCrawl to crawl sitemaps and page links, analyze site structure and content distribution, and optimize search engine rankings.

QA

  1. What programming language SDKs does WaterCrawl support?
    WaterCrawl provides SDKs for Node.js, Go, PHP and Python, covering a wide range of development scenarios. Each SDK supports full API functionality for easy integration.
  2. How do I handle a failed crawl task?
    Check the status of the task to verify that it did not fail due to a timeout or network issue. Adjustment timeout parameters or check the anti-crawl mechanism of the target site. If necessary, contact the support@watercrawl.devThe
  3. Are non-technical users supported?
    WaterCrawl is intended for developers and requires a basic knowledge of Python or Docker. Non-technical users may need assistance from a technical team to deploy and use it.
  4. How can I extend the functionality of WaterCrawl?
    utilization watercrawl-plugin The package develops custom plug-ins that define specific crawling or data processing logic that is integrated to run in the main program.
0Bookmarked
0kudos

Recommended

Can't find AI tools? Try here!

Just type in the keyword Accessibility Bing SearchYou can quickly find all the AI tools on this site.

inbox

Contact Us

Top

en_USEnglish