OCRmyPDF is an open source command line tool designed to add an Optical Character Recognition (OCR) text layer to scanned PDF files, turning them into searchable, reproducible documents. It is based on Python development , the use of Tesseract OCR engine , can accurately recognize the text in the image and embed it into the PDF , maintaining the layout of the original document and image quality . Tools support multi-language, for Linux, Windows, macOS and other platforms, but also provides a Docker image to facilitate cross-platform deployment.OCRmyPDF default generated PDF/A format, suitable for long-term archiving, while supporting page correction, image optimization and other features, widely used in document digitization and archiving scenarios.
Function List
- Add searchable OCR text layers to scanned PDFs with copy and paste support.
- Default generation of PDF/A format, suitable for long-term document archiving.
- Supports text recognition in 39 languages, covering English, German, Chinese and more.
- Automatic correction of page skew (deskew) and rotation (rotate-pages).
- Optimize PDF file size, often generating smaller output than the input file.
- Supports multi-core parallel processing to enhance the efficiency of large-scale document processing.
- Provides debug mode for easy verification of OCR results.
- Support functionality expansion through plug-ins, compatible with complex PDF structure.
- Automatically repair corrupted PDF files for enhanced compatibility.
Using Help
Installation process
The installation of OCRmyPDF requires the configuration of dependencies on supported operating systems, including Python, Tesseract, Ghostscript, and so on. Below are detailed installation steps for common operating systems:
Linux (Ubuntu 22.04 as an example)
- Make sure Python 3 and pip are installed on your system:
python3 --version pip3 --version
- Install dependencies:
sudo apt update sudo apt install tesseract-ocr ghostscript python3-pip pngquant
- Install OCRmyPDF using pip:
pip3 install ocrmypdf
- Verify the installation:
ocrmypdf --version
If the version number is displayed, the installation was successful.
Windows (computer)
- Install Python 3 (we recommend downloading the latest version via the official website).
- Install Tesseract and Ghostscript (Chocolatey package manager is recommended):
choco install tesseract ghostscript
- Install OCRmyPDF using pip:
pip install ocrmypdf
- Confirm that the installation is complete:
ocrmypdf --version
macOS (using Homebrew)
- Install Homebrew (if not already installed):
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
- Install dependencies:
brew install tesseract ghostscript ocrmypdf
- Verify the installation:
ocrmypdf --version
Docker Installation
- Make sure Docker is installed and running:
docker run hello-world
- Pull the OCRmyPDF image:
docker pull jbarlow83/ocrmypdf
- Mark the mirror as a convenient name:
docker tag jbarlow83/ocrmypdf ocrmypdf
Usage
OCRmyPDF is a command line tool , simple to use but powerful . The basic command format is:
ocrmypdf [选项] 输入文件 输出文件
basic operation
- Simple OCR conversion::
Convert scanned PDFs to searchable PDFs:ocrmypdf input.pdf output.pdf
This will take care of
input.pdf
Generate OCR text layers withoutput.pdf
The - Specify language::
Supports multi-language OCR, e.g. handling PDFs containing both English and Chinese:ocrmypdf -l eng+chi_sim input.pdf output.pdf
The language code can be found in the Tesseract documentation.
- Page correction and optimization::
Automatic tilt correction and PDF/A generation:ocrmypdf --deskew --output-type pdfa input.pdf output.pdf
- parallel processing::
Use multi-core to accelerate processing:ocrmypdf --jobs 4 input.pdf output.pdf
Featured Function Operation
- Page rotation: Automatically detects and fixes page orientation:
ocrmypdf --rotate-pages input.pdf output.pdf
transferring entity
--rotate-pages-threshold
Sets the rotation threshold. - Image Cleanup: Clean up images before OCR to improve recognition accuracy:
ocrmypdf --clean input.pdf output.pdf
- debug mode: Verify OCR results and generate detailed logs:
ocrmypdf --verbose 2 input.pdf output.pdf
- Skip existing text: Avoid duplicating pages with existing text:
ocrmypdf --skip-text input.pdf output.pdf
Docker Usage
Use Docker to run OCRmyPDF for scenarios with no local environment:
docker run --rm -v $(pwd):/data ocrmypdf /data/input.pdf /data/output.pdf
This command sets the current directory'sinput.pdf
Processed and output to theoutput.pdf
The
caveat
- Ensure that the input PDF is a scanned document, PDFs containing text may need to be scanned using the
--skip-text
The - Tesseract language packs need to be installed separately to support multiple languages, for example:
sudo apt install tesseract-ocr-chi-sim
- For complex PDFs, it is recommended to enable
--verbose
View detailed logs for easy troubleshooting.
application scenario
- Digitization of documents
After scanning a paper document to PDF, use OCRmyPDF to add a text layer for easy searching and copying of the content, suitable for file management or legal document archiving. - academic research
Researchers can convert scanned academic papers into searchable PDFs, making it easy to extract citations or keywords and improving the efficiency of literature management. - Corporate Archiving
Businesses can batch process scanned contracts and invoices to generate PDF/A format to ensure long-term retention and legal compliance. - Multilingual Document Processing
Handling multilingual scanned documents such as mixed Chinese and English contracts, OCRmyPDF recognizes multiple languages and embeds the text.
QA
- What operating systems does OCRmyPDF support?
Support for Linux, Windows, macOS and FreeBSD, also available cross-platform via Docker. - How do I handle non-English documents?
utilization-l
Specify the language code, e.g.-l chi_sim
To handle Chinese, you need to install the corresponding language pack. - What if the output file is larger than the input?
utilization--optimize 1
Or install the JBIG2 encoder to compress the file size. - How to verify OCR results?
utilization--verbose 2
Generate detailed logs, or check the output PDF for reproducible text.