It automatically analyzes the layout of PDF documents, identifies text, titles, images, tables, formulas and other elements in the page, and determines their correct order. The tool supports OCR functionality , you can convert scanned PDF to searchable text. It runs on Docker and provides two models: visual model (Vision Grid Transformer, or VGT) and LightGBM model. The former is highly accurate but resource-consuming, the latter is fast and resource-saving. The current version is v0.0.21, free and open on GitHub, suitable for researchers, archivists, etc. who need to deal with PDF.
Function List
- Automatically recognize text, titles, images, tables, formulas and other elements in PDF pages.
- Support OCR function to convert scanned PDF to searchable text.
- Determine the correct reading order of page elements.
- Two analysis modes are provided, visual model (VGT) and LightGBM model.
- Extract tables and support multiple formats for output, such as Markdown, LaTeX, HTML.
- Extracts formulas and outputs LaTeX format by default.
- Supports multi-language OCR, such as English, Korean, etc.
- Provides API interface for integration into other projects.
- Supports visual output to generate PDF with annotations.
Using Help
Installation process
This tool runs with Docker and the installation steps are as follows:
- Preparing the environment
Install Docker first. go to the Docker website to download and install it. After installation, type in the terminal:
docker --version
If the version number is displayed, it is successful. If using a GPU, you also need to install the NVIDIA Container Toolkit, refer to theInstallation GuideThe
- Pulling Mirrors
Enter the command in the terminal to pull the tool image:
- There's the GPU:
docker pull huridocs/pdf-document-layout-analysis:v0.0.21
- No GPU:
docker pull huridocs/pdf-document-layout-analysis:v0.0.21
- Operational services
Start the service in two ways:
- There's the GPU:
docker run --rm --name pdf-analysis --gpus '"device=0"' -p 5060:5060 huridocs/pdf-document-layout-analysis:v0.0.21
- No GPU:
docker run --rm --name pdf-analysis -p 5060:5060 huridocs/pdf-document-layout-analysis:v0.0.21
When the service starts, it listens on port 5060 by default. If the port is occupied, it can be changed to another port, such as 5061.
- validation service
Open your browser and visithttp://localhost:5060/info
If the version information is returned, the operation is normal.
How to use the main features
The tool operates through an API with the following common functions:
1. OCR function
To convert scanned PDF to searchable text, you can use OCR.
- procedure::
Prepare a PDF such astest.pdf
, run in the terminal:
curl -X POST -F 'language=en' -F 'file=@/path/to/test.pdf' localhost:5060/ocr --output result.pdf
language=en
is English and can be replaced withkor
(Korean), etc. Supported languages are available through thecurl localhost:5060/info
View./path/to/test.pdf
is the file path, e.g./home/user/test.pdf
The- output file
result.pdf
will be saved in the current directory. - in the end::
Get a searchable PDF with text that can be copied.
2. Layout analysis
To extract the elements in the PDF and analyze the layout:
- procedure::
Running:
curl -X POST -F 'file=@/path/to/test.pdf' localhost:5060 --output analysis.json
- output file
analysis.json
Contains element information such as location, type (text, table, etc.). - in the end::
The JSON file lists the details of each element.
3. Rapid mode
Want faster processing, use LightGBM model, add parametersfast=true
::
curl -X POST -F 'file=@/path/to/test.pdf' -F 'fast=true' localhost:5060 --output fast_analysis.json
- take note of: Fast, but slightly less accurate.
4. Table and formula extraction
- Withdrawal form::
Specify the format (e.g. Markdown):
curl -X POST -F 'file=@/path/to/test.pdf' -F 'extraction_format=markdown' localhost:5060 --output table.json
be in favor ofmarkdown
,latex
,html
Format.
- Extraction formula::
The default output is LaTeX format, which can be analyzed directly with the Layout Analysis command.
5. Visualization output
Would like to see the labeled PDF:
curl -X POST -F 'file=@/path/to/test.pdf' localhost:5060/visualize --output visualized.pdf
- in the end::
The output PDF will be labeled with the location and type of each element.
6. Adding language support
A few languages are supported by default, would like to add more languages (e.g. Chinese):
- Enter the container:
docker exec -it --user root pdf-analysis /bin/bash
- Install language packs, e.g. Chinese:
apt-get install tesseract-ocr-chi-sim
- Check:
curl localhost:5060/info
see thatchi_sim
Indicates success.
7. Discontinuation of services
Discontinuation of services:
docker stop pdf-analysis
Output element order
The results of the analysis are organized in a specific order. The tool uses Poppler to determine the initial reading order, which is then adjusted according to the element type:
- The header is at the top of the page, sorted in internal order.
- Common elements (text, tables, etc.) are arranged in average reading order.
- The footer and footnote are placed last.
- Elements without text (e.g., images) are ordered according to the order of the nearest element with text.
caveat
- hardware requirement: Visual model requires GPU and 5GB of video memory, without GPU it will be slow with CPU. lightGBM is CPU only and requires 2GB of RAM.
- tempo: 15 pages of academic papers, 0.42 sec/page in fast mode, 1.75 sec/page in VGT (GPU), 13.5 sec/page in VGT (CPU).
- adjust components during testing: View the log when something goes wrong:
docker logs pdf-analysis
These features and steps will help you get started quickly and handle a variety of PDF needs.
application scenario
- academic research
Researchers use it to extract tables and formulas from papers and organize data more efficiently. - file management
Archivists convert scans of old documents into searchable PDFs that are easy to find. - Legal work
Attorneys analyze contract PDFs to quickly locate clauses and forms.
QA
- Is there a fee?
No charge. This is open source tool, free to download and use on GitHub. - Do I need to network?
Internet connection is required to download the image, after which it can be run offline. - Does it support Chinese?
Support. Chinese packages need to be installed manually (e.g.tesseract-ocr-chi-sim
), slightly less effective than English but usable.