
OneAIFW: A Lightweight Open Source Firewall for Protecting the Privacy of Big Model Data
OneAIFW(aifw)是由Funstory.ai开发的一款旨在解决大语言模型(LLM)数据隐私问题的开源工具。在当前的大模型应用中,用户经常需要将包含个人身份信息(PII)或商业机密的文本发送给云端模型(如ChatGPT、Claude等...

AutoForm: AI tool that extracts data from any document and automatically fills web forms
AutoForm是一个使用人工智能技术,旨在将用户从重复的数据复制和粘贴工作中解放出来的工具。 它的核心功能是作为一个“AI数据录入代理”,能够读取和理解多种格式的非结构化文件,例如PDF文档、电子表格、邮件、网页乃至视频等。 AutoFo...

OCRmyPDF: scanned PDF into searchable text of the open source tool
OCRmyPDF 是一个开源的命令行工具,专门用于为扫描的PDF文件添加光学字符识别(OCR)文本层,使其变为可搜索、可复制的文档。它基于Python开发,使用Tesseract OCR引擎,能准确识别图像中的文字,并将其嵌入PDF中,保持...

Docstrange: a tool for extracting data from documents and images and converting them to multiple formats
Docstrange is an open source document processing tool that focuses on extracting data from documents and images in multiple formats and converting them to formats such as Markdown, JSON, CSV or HTML. It utilizes artificial intelligence and advanced OCR technology , support for processing PDF, Word documents, Exce...

LangExtract: open source tools to extract structured data from text
LangExtract is an open source Python library developed by Google, focusing on extracting structured data from unstructured text. It uses large language models (LLMs) such as the Google Gemini family , combined with accurate source text location and interactive visualization features to help users quickly complex text ...

Chat4Data: an AI tool for extracting web data through natural language
Chat4Data 是一个基于人工智能的 Chrome 浏览器扩展工具,专注于简化网页数据提取。它通过自然语言对话让用户轻松获取网页上的结构化数据,无需编写代码。用户只需用简单的语言描述所需数据,如产品名称、价格或联系方式,Chat4Dat...

ytt-mcp: server tool to get and process subtitles for YouTube videos
ytt-mcp是一个开源的MCP(模型上下文协议)服务器工具,专门用于从YouTube视频中获取字幕并进行处理。它由cottongeeks团队开发,托管在GitHub上,旨在帮助用户通过简单命令或AI工具快速提取视频字幕,并支持进一步的内容...
WaterCrawl: transforming web content into data usable for large models
WaterCrawl is a powerful open source web crawler tool designed to help users extract data from web pages and transform it into a data format suitable for Large Language Model (LLM) processing. It is based on Python development , combined with Django, Scrapy and Celery technology , supports efficient web crawling and data ...

Dolphin
Dolphin 是由 ByteDance 开发的一款开源文档图像解析工具,专注于处理复杂的文档图像,如包含文本、表格、公式和图片的扫描件或 PDF 文件。它采用“先分析后解析”的方法,通过两阶段处理实现高效解析:首先分析文档的页面布局,生成...

OneFileLLM: Integrating Multiple Data Sources into a Single Text File
OneFileLLM is an open source command line tool designed to consolidate multiple data sources into a single text file for easy input into Large Language Models (LLMs). It supports processing GitHub repositories, ArXiv papers, YouTube video transcriptions, web content, Sci-Hub papers and local files, automatically generating the structure...

Chatlog: extract and query WeChat chat logs of open source tools
Chatlog is an open source tool that focuses on extracting and querying chat logs from WeChat's local database. It supports WeChat versions 3.x and 4.0, covering Windows and macOS systems. Users can operate from the command line, terminal interface or HTTP API to view chat logs, contacts, group chats and...

DevDocs: an MCP service for quickly crawling and organizing technical documentation
DevDocs is a completely free and open source tool developed by the CyberAGI team and hosted on GitHub. It is designed for programmers and software developers to start from the URL of the technical documentation, automatically crawl the relevant pages and organize them into concise Markdown or JSON files. It has a built-in MCP ...

Free Conversion of Multiple Files to Markdown Format Based on Workers AI
serverless-markdown-convertor is a free open source tool based on Cloudflare Worker and Workers AI that converts a wide range of files to Markdown format. It supports PDF, images, Office documents ...

GPT-Crawler: Automatically Crawling Website Content to Generate Knowledge Base Documents
GPT-Crawler is an open source tool developed by the BuilderIO team and hosted on GitHub. It crawls page content by entering one or more website URLs, generating a structured knowledge file (output.json) that can be used to create a custom GPT or AI assistant. Users...

pure.md: insert "pure.md/" in front of the URL to extract clean text.
pure.md 是一个为 AI 代理和开发者设计的工具,主打快速将网页内容或文件转为 Markdown 格式。它通过代理服务绕过反爬虫限制,提取网页核心数据,并输出简洁的 Markdown 文件。无论是动态网页、PDF 文件还是社交媒体内容...

Cloudsquid: upload documents and describe requirements for intelligent extraction of structured data
Cloudsquid 是一家 2023 年成立于德国柏林的公司,专注于用人工智能简化文件处理。它的核心产品是一个在线数据提取平台,用户只需上传 PDF、图片、音频、视频等文件,简单说明需要提取的数据,比如“找出姓名和金额”,AI 就会自动完...

PDF Craft: PDF scanned documents to Markdown open source tools
PDF Craft 是一个开源工具,专为扫描书籍的PDF设计,能将其转换为Markdown格式。它由 oomol-lab 开发,托管在 GitHub 上,适合喜欢整理电子书的用户。工具通过本地AI模型运行,无需联网,既保护隐私又方便操作。它...

Supametas.AI: Extracting Unstructured Data into LLM Highly Available Data
Supametas.AI 是一个数据处理平台,专门把网页、文档、音视频等杂乱信息整理成AI能用的结构化数据。它支持从多个来源收集数据,包括网页链接、API、本地文件等,然后输出为 JSON 或 Markdown 格式。平台无需编程经验,普通...

MarkPDFDown: based on the multimodal model will be converted to PDF Markdown file
MarkPDFDown is an open source tool. It utilizes a multimodal large language model to convert PDF files into Markdown format. Developed by GitHub user jorben, this tool has a simple goal: to make PDF documents easier to edit and share. It recognizes headings, lists,...
Top