{"id":18866,"date":"2025-01-19T21:11:49","date_gmt":"2025-01-19T13:11:49","guid":{"rendered":"https:\/\/www.aisharenet.com\/?p=18866"},"modified":"2025-08-25T00:34:36","modified_gmt":"2025-08-24T16:34:36","slug":"zerox","status":"publish","type":"post","link":"https:\/\/www.kdjingpai.com\/en\/zerox\/","title":{"rendered":"Zerox\uff1aPDF\u3001DOCX\u3001\u56fe\u50cf\u8f6c\u6362\u4e3aMarkdown\uff0c\u89c6\u89c9\u6a21\u578b\u9ad8\u7cbe\u5ea6OCR"},"content":{"rendered":"<p>Zerox\u662f\u4e00\u4e2a\u5f00\u6e90\u9879\u76ee\uff0c\u65e8\u5728\u901a\u8fc7\u89c6\u89c9\u6a21\u578b\u5c06PDF\u3001DOCX\u3001\u56fe\u50cf\u7b49\u6587\u4ef6\u8f6c\u6362\u4e3aMarkdown\u683c\u5f0f\u3002\u8be5\u9879\u76ee\u7531getomni-ai\u56e2\u961f\u5f00\u53d1\uff0c\u63d0\u4f9b\u4e86\u7b80\u5355\u9ad8\u6548\u7684OCR\uff08\u5149\u5b66\u5b57\u7b26\u8bc6\u522b\uff09\u89e3\u51b3\u65b9\u6848\u3002Zerox\u652f\u6301Node\u548cPython\u4e24\u79cd\u7f16\u7a0b\u8bed\u8a00\uff0c\u5229\u7528graphicsmagick\u548cghostscript\u8fdb\u884cPDF\u5230\u56fe\u50cf\u7684\u5904\u7406\u3002\u7528\u6237\u53ef\u4ee5\u901a\u8fc7\u63d0\u4f9b\u6587\u4ef6\u8def\u5f84\u548cOpenAI API\u5bc6\u94a5\uff0c\u5feb\u901f\u5c06\u6587\u6863\u8f6c\u6362\u4e3aMarkdown\u683c\u5f0f\uff0c\u9002\u7528\u4e8e\u5404\u79cd\u590d\u6742\u5e03\u5c40\u7684\u6587\u6863\uff0c\u5982\u8868\u683c\u548c\u56fe\u8868\u3002<\/p>\n<p><img decoding=\"async\" title=\"Zerox\uff1a\u5c06PDFDOCX\u3001\u56fe\u50cf\u8f6c\u6362\u4e3aMarkdown\uff0c\u4f7f\u7528\u89c6\u89c9\u6a21\u578b\u5b9e\u73b0\u9ad8\u6548OCR-1\" src=\"https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/01\/eb597bb6f2e3c55.png\" alt=\"Zerox\uff1a\u5c06PDFDOCX\u3001\u56fe\u50cf\u8f6c\u6362\u4e3aMarkdown\uff0c\u4f7f\u7528\u89c6\u89c9\u6a21\u578b\u5b9e\u73b0\u9ad8\u6548OCR-1\" \/><\/p>\n<p>&nbsp;<\/p>\n<h2>\u529f\u80fd\u5217\u8868<\/h2>\n<ul>\n<li>\u652f\u6301PDF\u3001DOCX\u3001\u56fe\u50cf\u7b49\u6587\u4ef6\u683c\u5f0f\u7684\u8f6c\u6362<\/li>\n<li>\u63d0\u4f9bNode\u548cPython\u4e24\u79cd\u7f16\u7a0b\u8bed\u8a00\u7684\u652f\u6301<\/li>\n<li>\u5229\u7528\u89c6\u89c9\u6a21\u578b\u8fdb\u884c\u9ad8\u6548OCR\u5904\u7406<\/li>\n<li>\u81ea\u52a8\u5b89\u88c5graphicsmagick\u548cghostscript\u8fdb\u884cPDF\u5230\u56fe\u50cf\u7684\u5904\u7406<\/li>\n<li>\u652f\u6301\u6587\u4ef6\u8def\u5f84\u548cURL\u4e24\u79cd\u8f93\u5165\u65b9\u5f0f<\/li>\n<li>\u63d0\u4f9b\u591a\u79cd\u53ef\u9009\u53c2\u6570\uff0c\u5982\u5e76\u53d1\u5904\u7406\u3001\u9875\u9762\u65b9\u5411\u6821\u6b63\u3001\u9519\u8bef\u5904\u7406\u6a21\u5f0f\u7b49<\/li>\n<li>\u652f\u6301\u9884\u5904\u7406\u548c\u540e\u5904\u7406\u56de\u8c03\u51fd\u6570<\/li>\n<li>\u53ef\u9009\u62e9\u4fdd\u5b58\u8f6c\u6362\u7ed3\u679c\u5230\u6307\u5b9a\u76ee\u5f55<\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<h2>\u4f7f\u7528\u5e2e\u52a9<\/h2>\n<h3>\u5b89\u88c5\u6d41\u7a0b<\/h3>\n<h4>Node\u7248\u672c<\/h4>\n<ol>\n<li>\u5b89\u88c5Node.js\u548cnpm<\/li>\n<li>\u8fd0\u884c\u547d\u4ee4 <code>npm install zerox<\/code><\/li>\n<li>\u786e\u4fdd\u7cfb\u7edf\u5df2\u5b89\u88c5graphicsmagick\u548cghostscript\uff0c\u82e5\u672a\u5b89\u88c5\uff0c\u53ef\u8fd0\u884c\u4ee5\u4e0b\u547d\u4ee4\uff1a<\/li>\n<\/ol>\n<pre><code>   sudo apt-get update\r\nsudo apt-get install -y graphicsmagick ghostscript\r\n<\/code><\/pre>\n<h4>Python\u7248\u672c<\/h4>\n<ol>\n<li>\u5b89\u88c5Python\u548cpip<\/li>\n<li>\u8fd0\u884c\u547d\u4ee4 <code>pip install zerox<\/code><\/li>\n<li>\u786e\u4fdd\u7cfb\u7edf\u5df2\u5b89\u88c5graphicsmagick\u548cghostscript\uff0c\u82e5\u672a\u5b89\u88c5\uff0c\u53ef\u8fd0\u884c\u4ee5\u4e0b\u547d\u4ee4\uff1a<\/li>\n<\/ol>\n<pre><code>   sudo apt-get update\r\nsudo apt-get install -y graphicsmagick ghostscript\r\n<\/code><\/pre>\n<h3>\u4f7f\u7528\u65b9\u6cd5<\/h3>\n<h4>Node\u7248\u672c<\/h4>\n<ol>\n<li>\u5bfc\u5165zerox\u6a21\u5757\uff1a<\/li>\n<\/ol>\n<pre><code>   import { zerox } from \"zerox\";\r\n<\/code><\/pre>\n<ol start=\"2\">\n<li>\u4f7f\u7528\u6587\u4ef6\u8def\u5f84\u8fdb\u884c\u8f6c\u6362\uff1a<\/li>\n<\/ol>\n<pre><code>   const result = await zerox({\r\nfilePath: \"path\/to\/file.pdf\",\r\nopenaiAPIKey: process.env.OPENAI_API_KEY,\r\n});\r\n<\/code><\/pre>\n<ol start=\"3\">\n<li>\u4f7f\u7528URL\u8fdb\u884c\u8f6c\u6362\uff1a<\/li>\n<\/ol>\n<pre><code>   const result = await zerox({\r\nfilePath: \"https:\/\/example.com\/file.pdf\",\r\nopenaiAPIKey: process.env.OPENAI_API_KEY,\r\n});\r\n<\/code><\/pre>\n<h4>Python\u7248\u672c<\/h4>\n<ol>\n<li>\u5bfc\u5165zerox\u6a21\u5757\uff1a<\/li>\n<\/ol>\n<pre><code>   from zerox import zerox\r\n<\/code><\/pre>\n<ol start=\"2\">\n<li>\u4f7f\u7528\u6587\u4ef6\u8def\u5f84\u8fdb\u884c\u8f6c\u6362\uff1a<\/li>\n<\/ol>\n<pre><code>   result = zerox(\r\nfile_path=\"path\/to\/file.pdf\",\r\nopenai_api_key=\"your_openai_api_key\"\r\n)\r\n<\/code><\/pre>\n<ol start=\"3\">\n<li>\u4f7f\u7528URL\u8fdb\u884c\u8f6c\u6362\uff1a<\/li>\n<\/ol>\n<pre><code>   result = zerox(\r\nfile_path=\"https:\/\/example.com\/file.pdf\",\r\nopenai_api_key=\"your_openai_api_key\"\r\n)\r\n<\/code><\/pre>\n<h3>\u4e3b\u8981\u529f\u80fd\u64cd\u4f5c\u6d41\u7a0b<\/h3>\n<ol>\n<li><strong>\u6587\u4ef6\u8f6c\u6362<\/strong>\uff1a\u63d0\u4f9b\u6587\u4ef6\u8def\u5f84\u6216URL\uff0c\u8c03\u7528zerox\u51fd\u6570\u8fdb\u884c\u8f6c\u6362\uff0c\u8fd4\u56deMarkdown\u683c\u5f0f\u7684\u6587\u672c\u3002<\/li>\n<li><strong>\u5e76\u53d1\u5904\u7406<\/strong>\uff1a\u901a\u8fc7\u8bbe\u7f6e<code>concurrency<\/code>\u53c2\u6570\uff0c\u63a7\u5236\u540c\u65f6\u5904\u7406\u7684\u9875\u9762\u6570\u91cf\uff0c\u63d0\u9ad8\u5904\u7406\u6548\u7387\u3002<\/li>\n<li><strong>\u9875\u9762\u65b9\u5411\u6821\u6b63<\/strong>\uff1a\u9ed8\u8ba4\u542f\u7528\u9875\u9762\u65b9\u5411\u6821\u6b63\u529f\u80fd\uff0c\u786e\u4fdd\u8f6c\u6362\u540e\u7684\u6587\u672c\u65b9\u5411\u6b63\u786e\u3002<\/li>\n<li><strong>\u9519\u8bef\u5904\u7406\u6a21\u5f0f<\/strong>\uff1a\u53ef\u9009\u62e9\u5ffd\u7565\u9519\u8bef\u6216\u629b\u51fa\u9519\u8bef\uff0c\u901a\u8fc7\u8bbe\u7f6e<code>errorMode<\/code>\u53c2\u6570\u8fdb\u884c\u914d\u7f6e\u3002<\/li>\n<li><strong>\u9884\u5904\u7406\u548c\u540e\u5904\u7406\u56de\u8c03<\/strong>\uff1a\u63d0\u4f9b\u56de\u8c03\u51fd\u6570\uff0c\u5728\u6bcf\u9875\u5904\u7406\u524d\u540e\u6267\u884c\u81ea\u5b9a\u4e49\u64cd\u4f5c\u3002<\/li>\n<li><strong>\u4fdd\u5b58\u7ed3\u679c<\/strong>\uff1a\u901a\u8fc7\u8bbe\u7f6e<code>outputDir<\/code>\u53c2\u6570\uff0c\u5c06\u8f6c\u6362\u7ed3\u679c\u4fdd\u5b58\u5230\u6307\u5b9a\u76ee\u5f55\u3002<\/li>\n<\/ol>\n<h3>\u793a\u4f8b\u4ee3\u7801<\/h3>\n<h4>Node\u7248\u672c<\/h4>\n<pre><code>import { zerox } from \"zerox\";\r\nconst result = await zerox({\r\nfilePath: \"path\/to\/file.pdf\",\r\nopenaiAPIKey: process.env.OPENAI_API_KEY,\r\ncleanup: true,\r\nconcurrency: 10,\r\ncorrectOrientation: true,\r\nerrorMode: \"IGNORE\",\r\nmaintainFormat: false,\r\nmaxRetries: 1,\r\nmaxTesseractWorkers: -1,\r\nmodel: \"gpt-4o-mini\",\r\nonPostProcess: async ({ page, progressSummary }) =&gt; Promise&lt;void&gt;,\r\nonPreProcess: async ({ imagePath, pageNumber }) =&gt; Promise&lt;void&gt;,\r\noutputDir: \"output\",\r\npagesToConvertAsImages: -1,\r\n});\r\n<\/code><\/pre>\n<h4>Python\u7248\u672c<\/h4>\n<pre><code>from zerox import zerox\r\nresult = zerox(\r\nfile_path=\"path\/to\/file.pdf\",\r\nopenai_api_key=\"your_openai_api_key\",\r\ncleanup=True,\r\nconcurrency=10,\r\ncorrect_orientation=True,\r\nerror_mode=\"IGNORE\",\r\nmaintain_format=False,\r\nmax_retries=1,\r\nmax_tesseract_workers=-1,\r\nmodel=\"gpt-4o-mini\",\r\non_post_process=lambda page, progress_summary: None,\r\non_pre_process=lambda image_path, page_number: None,\r\noutput_dir=\"output\",\r\npages_to_convert_as_images=-1,\r\n)<\/code><\/pre>\n<p>&nbsp;<\/p>\n<p>\u6211\u4eec\u4f7f\u7528\u00a0<code>libreoffice<\/code>\u00a0\u548c\u00a0<code>graphicsmagick<\/code>\u00a0\u7684\u7ec4\u5408\u6765\u8fdb\u884c\u6587\u6863\u5230\u56fe\u50cf\u7684\u8f6c\u6362\u3002\u5bf9\u4e8e\u975e\u56fe\u50cf\/\u975e PDF \u6587\u4ef6\uff0c\u6211\u4eec\u4f7f\u7528 libreoffice \u5c06\u8be5\u6587\u4ef6\u8f6c\u6362\u4e3a PDF\uff0c\u7136\u540e\u518d\u8f6c\u6362\u4e3a\u56fe\u50cf\u3002<\/p>\n<pre>[\r\n\"pdf\", \/\/ Portable Document Format\r\n\"doc\", \/\/ Microsoft Word 97-2003\r\n\"docx\", \/\/ Microsoft Word 2007-2019\r\n\"odt\", \/\/ OpenDocument Text\r\n\"ott\", \/\/ OpenDocument Text Template\r\n\"rtf\", \/\/ Rich Text Format\r\n\"txt\", \/\/ Plain Text\r\n\"html\", \/\/ HTML Document\r\n\"htm\", \/\/ HTML Document (alternative extension)\r\n\"xml\", \/\/ XML Document\r\n\"wps\", \/\/ Microsoft Works Word Processor\r\n\"wpd\", \/\/ WordPerfect Document\r\n\"xls\", \/\/ Microsoft Excel 97-2003\r\n\"xlsx\", \/\/ Microsoft Excel 2007-2019\r\n\"ods\", \/\/ OpenDocument Spreadsheet\r\n\"ots\", \/\/ OpenDocument Spreadsheet Template\r\n\"csv\", \/\/ Comma-Separated Values\r\n\"tsv\", \/\/ Tab-Separated Values\r\n\"ppt\", \/\/ Microsoft PowerPoint 97-2003\r\n\"pptx\", \/\/ Microsoft PowerPoint 2007-2019\r\n\"odp\", \/\/ OpenDocument Presentation\r\n\"otp\", \/\/ OpenDocument Presentation Template\r\n];<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Zerox\u662f\u4e00\u4e2a\u5f00\u6e90\u9879\u76ee\uff0c\u65e8\u5728\u901a\u8fc7\u89c6\u89c9\u6a21\u578b\u5c06PDF\u3001DOCX\u3001\u56fe\u50cf\u7b49\u6587\u4ef6\u8f6c\u6362\u4e3aMarkdown\u683c\u5f0f\u3002\u8be5\u9879\u76ee\u7531getomni-ai\u56e2\u961f\u5f00\u53d1\uff0c\u63d0\u4f9b\u4e86\u7b80\u5355\u9ad8\u6548\u7684OCR\uff08\u5149\u5b66\u5b57\u7b26\u8bc6\u522b\uff09\u89e3\u51b3\u65b9\u6848\u3002Zerox\u652f\u6301Node\u548cPython\u4e24\u79cd\u7f16\u7a0b\u8bed\u8a00\uff0c\u5229\u7528&#8230;<\/p>\n","protected":false},"author":1,"featured_media":32782,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20,499],"tags":[230,252],"class_list":["post-18866","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-tool","category-document-extraction","tag-aikaiyuanxiangmu","tag-markdown"],"_links":{"self":[{"href":"https:\/\/www.kdjingpai.com\/en\/wp-json\/wp\/v2\/posts\/18866","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.kdjingpai.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.kdjingpai.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.kdjingpai.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.kdjingpai.com\/en\/wp-json\/wp\/v2\/comments?post=18866"}],"version-history":[{"count":0,"href":"https:\/\/www.kdjingpai.com\/en\/wp-json\/wp\/v2\/posts\/18866\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.kdjingpai.com\/en\/wp-json\/wp\/v2\/media\/32782"}],"wp:attachment":[{"href":"https:\/\/www.kdjingpai.com\/en\/wp-json\/wp\/v2\/media?parent=18866"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.kdjingpai.com\/en\/wp-json\/wp\/v2\/categories?post=18866"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.kdjingpai.com\/en\/wp-json\/wp\/v2\/tags?post=18866"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}