Overseas access: www.kdjingpai.com

Bookmark Us

Document Extraction and Cleaning

 Submit Website

AutoForm: AI tool that extracts data from any document and automatically fills web forms
AutoForm is a tool that uses artificial intelligence technology to free users from repetitive data copying and pasting tasks. Its core function is to act as an “AI data entry agent” that can read and understand unstructured files in a variety of formats, such as PDF documents, spreadsheets, emails, web pages and even videos. AutoFo...
08-22 1.6 K0kudos
OCRmyPDF: scanned PDF into searchable text of the open source tool
OCRmyPDF is an open source command line tool designed to add an Optical Character Recognition (OCR) text layer to scanned PDF files, turning them into searchable, reproducible documents. It is based on Python development , using the Tesseract OCR engine , can accurately recognize the text in the image and embedded in the PDF to keep ...
08-04 8.0 K0kudos
Docstrange: a tool for extracting data from documents and images and converting them to multiple formats
Docstrange is an open source document processing tool that focuses on extracting data from documents and images in multiple formats and converting them to formats such as Markdown, JSON, CSV or HTML. It utilizes artificial intelligence and advanced OCR technology , support for processing PDF, Word documents, Exce...
08-04 3.7 K0kudos
LangExtract: open source tools to extract structured data from text
LangExtract is an open source Python library developed by Google, focusing on extracting structured data from unstructured text. It uses large language models (LLMs) such as the Google Gemini family , combined with accurate source text location and interactive visualization features to help users quickly complex text ...
07-31 4.1 K0kudos
MD-TOOL: Free Markdown Online Conversion Tool
MD-TOOL is a free online toolset site that focuses on conversion services between Markdown format and other file formats. The core features of this site include real-time conversion of Markdown text to HTML code, conversion of HTML code to Markdown text, and conversion of Markdown documents to...
07-28 1.4 K0kudos
OCRFlux: Lightweight tool for converting PDFs and images to Markdown
OCRFlux is an open source lightweight tool focused on converting PDF files and images to clear Markdown format. It is developed by the ChatDOC team, built on a large multimodal model with 3B parameters, and can run on common hardware such as GTX 3090. The tool specializes in complex document layouts,...
07-22 2.6 K0kudos
ytt-mcp: server tool to get and process subtitles for YouTube videos
ytt-mcp is an open source MCP (Model Context Protocol) server tool specialized in capturing subtitles from YouTube videos and processing them. Developed by the cottongeeks team and hosted on GitHub, it is designed to help users quickly extract video subtitles with simple commands or AI tools and support further content...
07-22 1.8 K0kudos
WaterCrawl: transforming web content into data usable for large models
WaterCrawl is a powerful open source web crawler tool designed to help users extract data from web pages and transform it into a data format suitable for Large Language Model (LLM) processing. It is based on Python development , combined with Django, Scrapy and Celery technology , supports efficient web crawling and data ...
07-18 2.3 K1kudos
OneFileLLM: Integrating Multiple Data Sources into a Single Text File
OneFileLLM is an open source command line tool designed to consolidate multiple data sources into a single text file for easy input into Large Language Models (LLMs). It supports processing GitHub repositories, ArXiv papers, YouTube video transcriptions, web content, Sci-Hub papers and local files, automatically generating the structure...
04-18 2.4 K0kudos
Chatlog: extract and query WeChat chat logs of open source tools
Chatlog is an open source tool that focuses on extracting and querying chat logs from WeChat's local database. It supports WeChat versions 3.x and 4.0, covering Windows and macOS systems. Users can operate from the command line, terminal interface or HTTP API to view chat logs, contacts, group chats and...
04-12 1.0 W0kudos
VOP: OCR Tool for Extracting Complex Diagrams and Math Formulas
Versatile OCR Program is an open source Optical Character Recognition (OCR) tool designed specifically for processing complex academic and educational documents. It can extract text, tables, mathematical formulas, diagrams and schematics from PDF, images and other documents and generate structured data suitable for machine learning training. Supports multiple languages, including English...
04-12 2.7 K0kudos
DevDocs: an MCP service for quickly crawling and organizing technical documentation
DevDocs is a completely free and open source tool developed by the CyberAGI team and hosted on GitHub. It is designed for programmers and software developers to start from the URL of the technical documentation, automatically crawl the relevant pages and organize them into concise Markdown or JSON files. It has a built-in MCP ...
04-09 2.9 K0kudos
Automatically parse PDF content and extract text and tables of open source services
It automatically analyzes the layout of PDF documents, identifies text, titles, images, tables, formulas and other elements in the page, and determines their correct order. The tool supports OCR functionality , you can convert scanned PDF to searchable text. It runs on Docker and provides two models: visual model (Vision Grid Transfor...
04-09 3.2 K0kudos
Free Conversion of Multiple Files to Markdown Format Based on Workers AI
serverless-markdown-convertor is a free open source tool based on Cloudflare Worker and Workers AI that converts a wide range of files to Markdown format. It supports PDF, images, Office documents ...
03-30 2.6 K0kudos
GPT-Crawler: Automatically Crawling Website Content to Generate Knowledge Base Documents
GPT-Crawler is an open source tool developed by the BuilderIO team and hosted on GitHub. It crawls page content by entering one or more website URLs, generating a structured knowledge file (output.json) that can be used to create a custom GPT or AI assistant. Users...
03-29 3.7 K0kudos
pure.md: insert "pure.md/" in front of the URL to extract clean text.
pure.md is a tool designed for AI agents and developers that focuses on quickly converting web content or files to Markdown format. It bypasses anti-crawler restrictions through proxy services, extracts the core data of a web page, and outputs a clean Markdown file. Whether it's a dynamic web page, PDF file or social media content...
03-25 2.7 K0kudos
Cloudsquid: upload documents and describe requirements for intelligent extraction of structured data
Cloudsquid is a company founded in 2023 in Berlin, Germany, that specializes in using artificial intelligence to simplify document processing. Its core product is an online data extraction platform that allows users to upload PDFs, images, audio, video, etc., and simply state what data needs to be extracted, such as “find out the name and amount”, and the AI will automatically finish...
03-25 2.3 K0kudos
PDF Craft: PDF scanned documents to Markdown open source tools
PDF Craft is an open source tool designed for scanning PDFs of books and converting them to Markdown format. It is developed by oomol-lab and hosted on GitHub for users who like to organize their eBooks. The tool runs through a local AI model and does not require an internet connection, which protects privacy and facilitates operation. It...
03-24 3.7 K0kudos
Supametas.AI: Extracting Unstructured Data into LLM Highly Available Data
Supametas.AI is a data processing platform that specializes in organizing web pages, documents, audio and video, and other clutter into structured data that AI can use. It supports collecting data from multiple sources, including web links, APIs, local files, etc., and then outputting it into JSON or Markdown format. The platform requires no programming experience, ordinary...
03-24 2.6 K0kudos