
SmolDocling: a visual language model for efficient document processing in a small volume
SmolDocling is a Visual Language Model (VLM) developed by ds4sd team in collaboration with IBM, based on SmolVLM-256M, hosted on Hugging Face platform. SmolDocling is a visual language model (VLM) based on SmolVLM-256M, hosted on the Hugging Face platform, which is the world's smallest VLM with only 256M parameters.

Flying Paddle PP-TableMagic: Structured Information Extraction for Complex Tables
The goal of table recognition is to parse tables in images, accurately identify table structures and cell locations, and reduce them to structured table formats (e.g. HTML). In today's information age, a large amount of important tabular data still exists in an unstructured state (e.g., pictures of statistical tables in scanned documents, statistical tables in PDF financial reports, etc.), which cannot be...

Mistral OCR: 94.89% Overall Accuracy, 1000 Pages/30 Seconds, Only $1
In the long history of human civilization, every leap in the way information is acquired and analyzed has profoundly contributed to social progress. From the ancient hieroglyphics, to the portable papyrus, to the later emergence of the printing press and today's wave of digitalization, each technological innovation has greatly expanded the scope of dissemination and depth of application of human knowledge, which in turn has become a breeding ground for a new round of innovation...

Firecrawl MCP Server: Firecrawl-based Web Crawler MCP Service
Firecrawl MCP Server is an open source tool developed by MendableAI , based on the Model Context Protocol (MCP) protocol implementation , integrated with the Firecrawl API to provide powerful web crawling and data extraction . It specializes in ...

par_scrape: a crawler tool to intelligently extract data from web pages
par_scrape is a Python-based open source web crawler tool, launched on GitHub by developer Paul Robello, designed to help users intelligently extract data from web pages. It integrates Selenium and Playwright, two powerful browser automation...

PDF-Extract-Kit: extract the complex structure of PDF content of open source tools
PDF-Extract-Kit is an open source project developed by the OpenDataLab team , focusing on efficient extraction of high-quality content from complex and diverse PDF documents . It integrates advanced document parsing technology , support for layout detection , formula recognition , table extraction and OCR and other functions , suitable for academic papers , research ...

Crawl4LLM: An Efficient Web Crawling Tool for LLM Pretraining
Crawl4LLM is an open source project jointly developed by Tsinghua University and Carnegie Mellon University, focusing on optimizing the efficiency of web crawling for pre-training of large models (LLM). It significantly reduces ineffective crawling by intelligently selecting high-quality web page data, claiming to be able to reduce the workload of 100 web pages that would otherwise need to be crawled to 21, while maintaining the pre-training effect...

Markdownify MCP Server: Converts various content to Markdown format based on the MCP protocol.
Markdownify MCP Server is an open source tool based on the Model Context Protocol, hosted on GitHub and created by developer Zach Caceres. It focuses on the multiple file types (e.g. PDF, images, audio...

CodeWeaver: Automatically generate Markdown documents from code structure and content.
CodeWeaver is a command-line tool designed to weave a code base into a single, easy-to-navigate Markdown document. It generates a structured representation of a project's file hierarchy by recursively scanning through directories and embedding the contents of each file in code blocks. The tool is designed with the goal of simplifying codebase sharing and information extraction, and is particularly suited to...

Kreuzberg: open source tool to extract text from any document
Kreuzberg is a library for simplifying text extraction from PDF files, designed to provide a simple, hassle-free text extraction solution. The library is especially suited for RAG (Retrieval-Augmented Generation) services that require text extraction.Kreuzberg supports local operation, easy control...

Instructor: a Python library to simplify structured output workflows for large language models
Instructor is a popular Python library designed for processing structured output from large language models (LLMs). Built on Pydantic, it provides a simple, transparent, and user-friendly API for managing data validation, retries, and streaming responses.Instructor Monthly Under...

zChunk: a generic semantic chunking strategy based on Llama-70B
zChunk is a novel chunking strategy developed by ZeroEntropy that aims to provide a solution for generic semantic chunking. The strategy is based on the Llama-70B model and optimizes the chunking process of a document by prompting for chunk generation, ensuring that a high signal-to-noise ratio is maintained during information retrieval. zChunk is particularly suited for RAs requiring high-precision retrieval ...

Pulse: Business Solutions for Document Processing and Data Extraction
Pulse is an intelligent platform focused on document processing and data extraction, designed to help organizations and developers efficiently parse and process a wide range of complex documents. Through its advanced computer vision and multimodal processing technologies, Pulse is able to accurately extract structured data from documents in a variety of formats, including text, images, tables, and more. The platform supports a wide range of industry applications...

Rowfill: Batch Extraction of Structured Information from Documents and Automated Analysis
Rowfill is an open source document processing platform designed for knowledge workers. It leverages advanced artificial intelligence technologies to extract, analyze and process data from complex documents, images and PDFs.Rowfill supports native Large Language Models (LLM) and OpenAI visual models to ensure data privacy and security. The platform provides high...

PPTX2MD: Specialized tool for converting PPTX files to Markdown
PPTX2MD is an open source tool designed to convert PowerPoint PPTX files to Markdown format. Developed by GitHub user ssine, the tool supports retaining headings, lists, text formatting (such as bold, italic, color, and hyperlinks), images, and tables, among other formats.PPTX2MD also supports...

Repomix: packaging the code base into a text file for large model retrieval
Repomix (formerly known as Repopack) is an open source tool designed to package an entire codebase into a single, AI-friendly file. This tool allows developers to easily make their codebase available for analysis and processing by large language models such as Claude, ChatGPT, and Gemini. It was originally designed to ...

Yek: reading git repository text files and quickly chunking them for use in large models
Yek is a fast Rust-based tool for reading text files from a repository or directory, chunking them, and serializing them for use in large language models (LLMs). The tool uses the .gitignore rule by default to skip unneeded files and uses Git history to infer important files.Yek can read and serialize files based on an approximation of the “...

LlamaParse: High-quality document parsing and data extraction service by Llamaindex (1000 free pages per day).
LlamaParse is a powerful document parsing tool that can work with complex documents such as PDF, PowerPoint, Word documents and spreadsheets and convert them into structured data.LlamaParse offers multiple ways to use it, including a standalone REST API, Python packages, t...

UnDatas.IO: API service for accurate parsing of various types of unstructured data (paid)
UnDatas.IO is a platform focused on parsing and processing unstructured data. It utilizes advanced technology to automatically recognize document layouts and categorize tables, images, formulas and text, greatly simplifying the data processing process. The platform not only saves a lot of time in organizing data, but also helps users extract valuable insights from data to make more war...
Top