Current Position:fig. beginning " AI Professional Tools

OCRmyPDF: scanned PDF into searchable text of the open source tool

2025-08-04

AI Professional Tools/AI Tool/OCR/Data Capture

4.4 K 21

make a copy of

OCRmyPDF is an open source command line tool designed to add an Optical Character Recognition (OCR) text layer to scanned PDF files, turning them into searchable, reproducible documents. It is based on Python development , the use of Tesseract OCR engine , can accurately recognize the text in the image and embed it into the PDF , maintaining the layout of the original document and image quality . Tools support multi-language, for Linux, Windows, macOS and other platforms, but also provides a Docker image to facilitate cross-platform deployment.OCRmyPDF default generated PDF/A format, suitable for long-term archiving, while supporting page correction, image optimization and other features, widely used in document digitization and archiving scenarios.

Function List

Add searchable OCR text layers to scanned PDFs with copy and paste support.
Default generation of PDF/A format, suitable for long-term document archiving.
Supports text recognition in 39 languages, covering English, German, Chinese and more.
Automatic correction of page skew (deskew) and rotation (rotate-pages).
Optimize PDF file size, often generating smaller output than the input file.
Supports multi-core parallel processing to enhance the efficiency of large-scale document processing.
Provides debug mode for easy verification of OCR results.
Support functionality expansion through plug-ins, compatible with complex PDF structure.
Automatically repair corrupted PDF files for enhanced compatibility.

Using Help

Installation process

The installation of OCRmyPDF requires the configuration of dependencies on supported operating systems, including Python, Tesseract, Ghostscript, and so on. Below are detailed installation steps for common operating systems:

Linux (Ubuntu 22.04 as an example)

Make sure Python 3 and pip are installed on your system:
```
python3 --version
pip3 --version
```

Install dependencies:

sudo apt update
sudo apt install tesseract-ocr ghostscript python3-pip pngquant

Install OCRmyPDF using pip:
```
pip3 install ocrmypdf
```
Verify the installation:
```
ocrmypdf --version
```
If the version number is displayed, the installation was successful.

Windows (computer)

Install Python 3 (we recommend downloading the latest version via the official website).
Install Tesseract and Ghostscript (Chocolatey package manager is recommended):
```
choco install tesseract ghostscript
```
Install OCRmyPDF using pip:
```
pip install ocrmypdf
```
Confirm that the installation is complete:
```
ocrmypdf --version
```

macOS (using Homebrew)

Install Homebrew (if not already installed):

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Install dependencies:

brew install tesseract ghostscript ocrmypdf

Verify the installation:
```
ocrmypdf --version
```

Docker Installation

Make sure Docker is installed and running:
```
docker run hello-world
```
Pull the OCRmyPDF image:
```
docker pull jbarlow83/ocrmypdf
```
Mark the mirror as a convenient name:
```
docker tag jbarlow83/ocrmypdf ocrmypdf
```

Usage

OCRmyPDF is a command line tool , simple to use but powerful . The basic command format is:

ocrmypdf [选项] 输入文件 输出文件

basic operation

Simple OCR conversion::
Convert scanned PDFs to searchable PDFs:
```
ocrmypdf input.pdf output.pdf
```
This will take care ofinput.pdfGenerate OCR text layers withoutput.pdfThe
Specify language::
Supports multi-language OCR, e.g. handling PDFs containing both English and Chinese:
```
ocrmypdf -l eng+chi_sim input.pdf output.pdf
```
The language code can be found in the Tesseract documentation.
Page correction and optimization::
Automatic tilt correction and PDF/A generation:
```
ocrmypdf --deskew --output-type pdfa input.pdf output.pdf
```
parallel processing::
Use multi-core to accelerate processing:
```
ocrmypdf --jobs 4 input.pdf output.pdf
```

Featured Function Operation

Page rotation: Automatically detects and fixes page orientation:
```
ocrmypdf --rotate-pages input.pdf output.pdf
```
transferring entity--rotate-pages-thresholdSets the rotation threshold.
Image Cleanup: Clean up images before OCR to improve recognition accuracy:
```
ocrmypdf --clean input.pdf output.pdf
```
debug mode: Verify OCR results and generate detailed logs:
```
ocrmypdf --verbose 2 input.pdf output.pdf
```
Skip existing text: Avoid duplicating pages with existing text:
```
ocrmypdf --skip-text input.pdf output.pdf
```

Docker Usage

Use Docker to run OCRmyPDF for scenarios with no local environment:

docker run --rm -v $(pwd):/data ocrmypdf /data/input.pdf /data/output.pdf

This command sets the current directory'sinput.pdfProcessed and output to theoutput.pdfThe

caveat

Ensure that the input PDF is a scanned document, PDFs containing text may need to be scanned using the--skip-textThe
Tesseract language packs need to be installed separately to support multiple languages, for example:
```
sudo apt install tesseract-ocr-chi-sim
```
For complex PDFs, it is recommended to enable--verboseView detailed logs for easy troubleshooting.

application scenario

Digitization of documents
After scanning a paper document to PDF, use OCRmyPDF to add a text layer for easy searching and copying of the content, suitable for file management or legal document archiving.
academic research
Researchers can convert scanned academic papers into searchable PDFs, making it easy to extract citations or keywords and improving the efficiency of literature management.
Corporate Archiving
Businesses can batch process scanned contracts and invoices to generate PDF/A format to ensure long-term retention and legal compliance.
Multilingual Document Processing
Handling multilingual scanned documents such as mixed Chinese and English contracts, OCRmyPDF recognizes multiple languages and embeds the text.

QA

What operating systems does OCRmyPDF support?
Support for Linux, Windows, macOS and FreeBSD, also available cross-platform via Docker.
How do I handle non-English documents?
utilization-lSpecify the language code, e.g.-l chi_simTo handle Chinese, you need to install the corresponding language pack.
What if the output file is larger than the input?
utilization--optimize 1Or install the JBIG2 encoder to compress the file size.
How to verify OCR results?
utilization--verbose 2Generate detailed logs, or check the output PDF for reproducible text.

AI open source project Document Extraction and Cleaning

AI productivity tools " OCRmyPDF: scanned PDF into searchable text of the open source tool Posted on 2025-08-04, please contact us if you find the URL is out of date, or inaccessible.

0Bookmarked

0kudos

OCRmyPDF: scanned PDF into searchable text of the open source tool

Function List

Using Help

Installation process

Linux (Ubuntu 22.04 as an example)

Windows (computer)

macOS (using Homebrew)

Docker Installation

Usage

basic operation

Featured Function Operation

Docker Usage

caveat

application scenario

QA

Recommended

Can't find AI tools? Try here!

Popular AI tools

New Releases

Latest AI tools

OCRmyPDF: scanned PDF into searchable text of the open source tool

Function List

Using Help

Installation process

Linux (Ubuntu 22.04 as an example)

Windows (computer)

macOS (using Homebrew)

Docker Installation

Usage

basic operation

Featured Function Operation

Docker Usage

caveat

application scenario

QA

Recommended

Can't find AI tools? Try here!

Popular AI tools

New Releases

Latest AI tools

Quick query station AI tool