When dealing with multilingual PDF documents, you need to use-l
parameter specifies the language code combination:
- Basic command format:
ocrmypdf -l 语言代码1+语言代码2 input.pdf output.pdf
- For example, handling mixed Chinese and English documents:
ocrmypdf -l eng+chi_sim input.pdf output.pdf
Caveats:
- The corresponding Tesseract language packs must be installed in advance, e.g. for Chinese you need to install the
tesseract-ocr-chi-sim
- The language code can be found in the Tesseract documentation.
- Recommended Use
--verbose 2
Parameter Validation Recognition Results - For complex typeset documents, it may be necessary to adjust parameters or use plug-ins
This answer comes from the articleOCRmyPDF: scanned PDF into searchable text of the open source toolThe