{"id":16983,"date":"2025-01-02T15:05:58","date_gmt":"2025-01-02T07:05:58","guid":{"rendered":"https:\/\/www.aisharenet.com\/?p=16983"},"modified":"2025-08-25T00:44:51","modified_gmt":"2025-08-24T16:44:51","slug":"extractthinker","status":"publish","type":"post","link":"https:\/\/www.kdjingpai.com\/pt\/extractthinker\/","title":{"rendered":"ExtractThinker\uff1a\u63d0\u53d6\u548c\u5206\u7c7b\u6587\u6863\u4e3a\u7ed3\u6784\u5316\u6570\u636e\uff0c\u4f18\u5316\u6587\u6863\u5904\u7406\u6d41\u7a0b"},"content":{"rendered":"<p>ExtractThinker \u662f\u4e00\u4e2a\u7075\u6d3b\u7684\u6587\u6863\u667a\u80fd\u5de5\u5177\uff0c\u5229\u7528\u5927\u578b\u8bed\u8a00\u6a21\u578b\uff08LLMs\uff09\u4ece\u6587\u6863\u4e2d\u63d0\u53d6\u548c\u5206\u7c7b\u7ed3\u6784\u5316\u6570\u636e\uff0c\u63d0\u4f9b\u7c7b\u4f3c ORM \u7684\u65e0\u7f1d\u6587\u6863\u5904\u7406\u5de5\u4f5c\u6d41\u3002\u5b83\u652f\u6301\u591a\u79cd\u6587\u6863\u52a0\u8f7d\u5668\uff0c\u5305\u62ec Tesseract OCR\u3001Azure Form Recognizer\u3001AWS Textract \u548c Google Document AI \u7b49\u3002\u7528\u6237\u53ef\u4ee5\u4f7f\u7528 Pydantic \u6a21\u578b\u5b9a\u4e49\u81ea\u5b9a\u4e49\u63d0\u53d6\u5408\u540c\uff0c\u5b9e\u73b0\u7cbe\u786e\u7684\u6570\u636e\u63d0\u53d6\u3002\u8be5\u5de5\u5177\u8fd8\u652f\u6301\u5f02\u6b65\u5904\u7406\u3001\u591a\u683c\u5f0f\u6587\u6863\u5904\u7406\uff08\u5982 PDF\u3001\u56fe\u50cf\u3001\u7535\u5b50\u8868\u683c\u7b49\uff09\uff0c\u5e76\u4e0e\u591a\u79cd LLM \u63d0\u4f9b\u5546\uff08\u5982 OpenAI\u3001Anthropic\u3001Cohere \u7b49\uff09\u96c6\u6210\u3002<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-16984\" title=\"ExtractThinker\uff1a\u63d0\u53d6\u548c\u5206\u7c7b\u6587\u6863\u4e3a\u7ed3\u6784\u5316\u6570\u636e\uff0c\u4f18\u5316\u6587\u6863\u5904\u7406\u6d41\u7a0b-1\" src=\"https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/01\/804dafe7f73360a.jpg\" alt=\"ExtractThinker\uff1a\u63d0\u53d6\u548c\u5206\u7c7b\u6587\u6863\u4e3a\u7ed3\u6784\u5316\u6570\u636e\uff0c\u4f18\u5316\u6587\u6863\u5904\u7406\u6d41\u7a0b-1\" width=\"1005\" height=\"472\" srcset=\"https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/01\/804dafe7f73360a.jpg 3082w, https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/01\/804dafe7f73360a-300x141.jpg 300w, https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/01\/804dafe7f73360a-1024x480.jpg 1024w, https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/01\/804dafe7f73360a-768x360.jpg 768w, https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/01\/804dafe7f73360a-1536x721.jpg 1536w, https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/01\/804dafe7f73360a-2048x961.jpg 2048w, https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/01\/804dafe7f73360a-18x8.jpg 18w\" sizes=\"auto, (max-width: 1005px) 100vw, 1005px\" \/><\/p>\n<p>&nbsp;<\/p>\n<h2>\u529f\u80fd\u5217\u8868<\/h2>\n<ul>\n<li><strong>\u7075\u6d3b\u7684\u6587\u6863\u52a0\u8f7d\u5668<\/strong>\uff1a\u652f\u6301\u591a\u79cd\u6587\u6863\u52a0\u8f7d\u5668\uff0c\u5305\u62ec Tesseract OCR\u3001Azure Form Recognizer\u3001AWS Textract \u548c Google Document AI\u3002<\/li>\n<li><strong>\u81ea\u5b9a\u4e49\u63d0\u53d6\u5408\u540c<\/strong>\uff1a\u4f7f\u7528 Pydantic \u6a21\u578b\u5b9a\u4e49\u81ea\u5b9a\u4e49\u63d0\u53d6\u5408\u540c\uff0c\u5b9e\u73b0\u7cbe\u786e\u7684\u6570\u636e\u63d0\u53d6\u3002<\/li>\n<li><strong>\u9ad8\u7ea7\u5206\u7c7b<\/strong>\uff1a\u4f7f\u7528\u81ea\u5b9a\u4e49\u5206\u7c7b\u548c\u7b56\u7565\u5bf9\u6587\u6863\u6216\u6587\u6863\u90e8\u5206\u8fdb\u884c\u5206\u7c7b\u3002<\/li>\n<li><strong>\u5f02\u6b65\u5904\u7406<\/strong>\uff1a\u5229\u7528\u5f02\u6b65\u5904\u7406\u9ad8\u6548\u5904\u7406\u5927\u6587\u6863\u3002<\/li>\n<li><strong>\u591a\u683c\u5f0f\u652f\u6301<\/strong>\uff1a\u65e0\u7f1d\u5904\u7406\u5404\u79cd\u6587\u6863\u683c\u5f0f\uff0c\u5982 PDF\u3001\u56fe\u50cf\u3001\u7535\u5b50\u8868\u683c\u7b49\u3002<\/li>\n<li><strong>ORM \u98ce\u683c\u4ea4\u4e92<\/strong>\uff1a\u4ee5 ORM \u98ce\u683c\u4e0e\u6587\u6863\u548c LLM \u8fdb\u884c\u4ea4\u4e92\uff0c\u4fbf\u4e8e\u5f00\u53d1\u3002<\/li>\n<li><strong>\u5206\u5272\u7b56\u7565<\/strong>\uff1a\u5b9e\u73b0\u61d2\u60f0\u6216\u6025\u5207\u7684\u5206\u5272\u7b56\u7565\uff0c\u6309\u9875\u6216\u6574\u4f53\u5904\u7406\u6587\u6863\u3002<\/li>\n<li><strong>\u4e0e LLM \u96c6\u6210<\/strong>\uff1a\u8f7b\u677e\u4e0e\u4e0d\u540c\u7684 LLM \u63d0\u4f9b\u5546\uff08\u5982 OpenAI\u3001Anthropic\u3001Cohere \u7b49\uff09\u96c6\u6210\u3002<\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<h2>\u4f7f\u7528\u5e2e\u52a9<\/h2>\n<h3>\u5b89\u88c5\u6d41\u7a0b<\/h3>\n<ol>\n<li><strong>\u5b89\u88c5 ExtractThinker<\/strong>\uff1a\u4f7f\u7528 pip \u5b89\u88c5 ExtractThinker\uff1a<\/li>\n<\/ol>\n<pre><code>   pip install extract_thinker\r\n<\/code><\/pre>\n<h3>\u4f7f\u7528\u6307\u5357<\/h3>\n<h4>\u57fa\u672c\u63d0\u53d6\u793a\u4f8b<\/h4>\n<p>\u4ee5\u4e0b\u793a\u4f8b\u6f14\u793a\u5982\u4f55\u4f7f\u7528 PyPdf \u52a0\u8f7d\u6587\u6863\u5e76\u63d0\u53d6\u5408\u540c\u4e2d\u5b9a\u4e49\u7684\u7279\u5b9a\u5b57\u6bb5\uff1a<\/p>\n<pre><code>import os\r\nfrom dotenv import load_dotenv\r\nfrom extract_thinker import Extractor, DocumentLoaderPyPdf, Contract\r\nload_dotenv()\r\nclass InvoiceContract(Contract):\r\ninvoice_number: str\r\ninvoice_date: str\r\n# \u8bbe\u7f6e Tesseract \u53ef\u6267\u884c\u6587\u4ef6\u7684\u8def\u5f84\r\ntest_file_path = os.path.join(\"path_to_your_files\", \"invoice.pdf\")\r\n# \u521d\u59cb\u5316\u63d0\u53d6\u5668\r\nextractor = Extractor()\r\nextractor.load_document_loader(DocumentLoaderPyPdf())\r\nextractor.load_llm(\"gpt-4o-mini\")  # \u6216\u4efb\u4f55\u5176\u4ed6\u652f\u6301\u7684\u6a21\u578b\r\n# \u4ece\u6587\u6863\u4e2d\u63d0\u53d6\u6570\u636e\r\nresult = extractor.extract(test_file_path, InvoiceContract)\r\nprint(\"Invoice Number:\", result.invoice_number)\r\nprint(\"Invoice Date:\", result.invoice_date)\r\n<\/code><\/pre>\n<h4>\u5206\u7c7b\u793a\u4f8b<\/h4>\n<p>ExtractThinker \u5141\u8bb8\u4f7f\u7528\u81ea\u5b9a\u4e49\u5206\u7c7b\u5bf9\u6587\u6863\u6216\u6587\u6863\u90e8\u5206\u8fdb\u884c\u5206\u7c7b\uff1a<\/p>\n<pre><code>import os\r\nfrom dotenv import load_dotenv\r\nfrom extract_thinker import Extractor, Classification, Process, ClassificationStrategy\r\nload_dotenv()\r\nclass CustomClassification(Classification):\r\ncategory: str\r\n# \u521d\u59cb\u5316\u63d0\u53d6\u5668\r\nextractor = Extractor()\r\nextractor.load_classification_strategy(ClassificationStrategy.CUSTOM)\r\n# \u5b9a\u4e49\u5206\u7c7b\u7b56\u7565\r\nclassification = CustomClassification(category=\"Invoice\")\r\n# \u4ece\u6587\u6863\u4e2d\u5206\u7c7b\u6570\u636e\r\nresult = extractor.classify(test_file_path, classification)\r\nprint(\"Category:\", result.category)\r\n<\/code><\/pre>\n<h3>\u8be6\u7ec6\u529f\u80fd\u64cd\u4f5c\u6d41\u7a0b<\/h3>\n<ol>\n<li><strong>\u52a0\u8f7d\u6587\u6863<\/strong>\uff1a\u4f7f\u7528\u652f\u6301\u7684\u6587\u6863\u52a0\u8f7d\u5668\uff08\u5982 PyPdf\u3001Tesseract OCR \u7b49\uff09\u52a0\u8f7d\u6587\u6863\u3002<\/li>\n<li><strong>\u5b9a\u4e49\u63d0\u53d6\u5408\u540c<\/strong>\uff1a\u4f7f\u7528 Pydantic \u6a21\u578b\u5b9a\u4e49\u81ea\u5b9a\u4e49\u63d0\u53d6\u5408\u540c\uff0c\u6307\u5b9a\u9700\u8981\u63d0\u53d6\u7684\u5b57\u6bb5\u3002<\/li>\n<li><strong>\u521d\u59cb\u5316\u63d0\u53d6\u5668<\/strong>\uff1a\u521b\u5efa Extractor \u5b9e\u4f8b\u5e76\u52a0\u8f7d\u6587\u6863\u52a0\u8f7d\u5668\u548c LLM \u6a21\u578b\u3002<\/li>\n<li><strong>\u63d0\u53d6\u6570\u636e<\/strong>\uff1a\u8c03\u7528 <code>extract<\/code> \u65b9\u6cd5\u4ece\u6587\u6863\u4e2d\u63d0\u53d6\u6570\u636e\uff0c\u5e76\u6839\u636e\u5408\u540c\u5b9a\u4e49\u7684\u5b57\u6bb5\u8fd4\u56de\u7ed3\u679c\u3002<\/li>\n<li><strong>\u5206\u7c7b\u6587\u6863<\/strong>\uff1a\u4f7f\u7528\u81ea\u5b9a\u4e49\u5206\u7c7b\u7b56\u7565\u5bf9\u6587\u6863\u6216\u6587\u6863\u90e8\u5206\u8fdb\u884c\u5206\u7c7b\uff0c\u8c03\u7528 <code>classify<\/code> \u65b9\u6cd5\u83b7\u53d6\u5206\u7c7b\u7ed3\u679c\u3002<\/li>\n<\/ol>\n<p>\u901a\u8fc7\u4ee5\u4e0a\u6b65\u9aa4\uff0c\u7528\u6237\u53ef\u4ee5\u9ad8\u6548\u5730\u4ece\u5404\u79cd\u683c\u5f0f\u7684\u6587\u6863\u4e2d\u63d0\u53d6\u548c\u5206\u7c7b\u6570\u636e\uff0c\u4f18\u5316\u6587\u6863\u5904\u7406\u6d41\u7a0b\u3002<\/p>\n","protected":false},"excerpt":{"rendered":"<p>ExtractThinker \u662f\u4e00\u4e2a\u7075\u6d3b\u7684\u6587\u6863\u667a\u80fd\u5de5\u5177\uff0c\u5229\u7528\u5927\u578b\u8bed\u8a00\u6a21\u578b\uff08LLMs\uff09\u4ece\u6587\u6863\u4e2d\u63d0\u53d6\u548c\u5206\u7c7b\u7ed3\u6784\u5316\u6570\u636e\uff0c\u63d0\u4f9b\u7c7b\u4f3c ORM \u7684\u65e0\u7f1d\u6587\u6863\u5904\u7406\u5de5\u4f5c\u6d41\u3002\u5b83\u652f\u6301\u591a\u79cd\u6587\u6863\u52a0\u8f7d\u5668\uff0c\u5305\u62ec Tesseract OCR\u3001Azure Form Reco&#8230;<\/p>\n","protected":false},"author":1,"featured_media":32782,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20,499],"tags":[230,252],"class_list":["post-16983","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-tool","category-document-extraction","tag-aikaiyuanxiangmu","tag-markdown"],"_links":{"self":[{"href":"https:\/\/www.kdjingpai.com\/pt\/wp-json\/wp\/v2\/posts\/16983","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.kdjingpai.com\/pt\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.kdjingpai.com\/pt\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.kdjingpai.com\/pt\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.kdjingpai.com\/pt\/wp-json\/wp\/v2\/comments?post=16983"}],"version-history":[{"count":0,"href":"https:\/\/www.kdjingpai.com\/pt\/wp-json\/wp\/v2\/posts\/16983\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.kdjingpai.com\/pt\/wp-json\/wp\/v2\/media\/32782"}],"wp:attachment":[{"href":"https:\/\/www.kdjingpai.com\/pt\/wp-json\/wp\/v2\/media?parent=16983"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.kdjingpai.com\/pt\/wp-json\/wp\/v2\/categories?post=16983"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.kdjingpai.com\/pt\/wp-json\/wp\/v2\/tags?post=16983"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}