OCR technology to achieve PDF text editable
For the scanned version of the PDF can not be searched and copied the pain points , you can use the open source tool OCR function to achieve text conversion . Specific operation is divided into three steps:
- environmental preparation: After installing Docker, pull the dedicated image
huridocs/pdf-document-layout-analysis:v0.0.21The GPU and non-GPU mirrors are available separately. - service activation: By
docker runcommand to start the service, note that GPU devices need to add the--gpusparameters - file conversion: Send a request using the curl command
curl -X POST -F 'language=en' -F 'file=@/path/to/test.pdf' localhost:5060/ocr --output result.pdfThe language parameter can be replaced by the desired language (e.g. Korean kor).
Advanced Tips:
- Chinese support requires manual installation of language packs: go to Container Execution
apt-get install tesseract-ocr-chi-sim - Write shell scripts to make recurring calls to the API when dealing with large numbers of files.
- VGT visual models are recommended for documents with high quality requirements (GPU support required)
This answer comes from the articleAutomatically parse PDF content and extract text and tables of open source servicesThe































