{"id":20721,"date":"2025-02-10T10:17:59","date_gmt":"2025-02-10T02:17:59","guid":{"rendered":"https:\/\/www.aisharenet.com\/?p=20721"},"modified":"2025-08-25T00:27:19","modified_gmt":"2025-08-24T16:27:19","slug":"zchunk","status":"publish","type":"post","link":"https:\/\/www.kdjingpai.com\/de\/zchunk\/","title":{"rendered":"zChunk\uff1a\u57fa\u4e8eLlama-70B\u7684\u901a\u7528\u8bed\u4e49\u5206\u5757\u7b56\u7565"},"content":{"rendered":"<p>zChunk\u662f\u7531ZeroEntropy\u5f00\u53d1\u7684\u4e00\u79cd\u65b0\u578b\u5206\u5757\u7b56\u7565\uff0c\u65e8\u5728\u4e3a\u901a\u7528\u8bed\u4e49\u5206\u5757\u63d0\u4f9b\u89e3\u51b3\u65b9\u6848\u3002\u8be5\u7b56\u7565\u57fa\u4e8eLlama-70B\u6a21\u578b\uff0c\u901a\u8fc7\u63d0\u793a\u751f\u6210\u5206\u5757\uff0c\u4f18\u5316\u4e86\u6587\u6863\u7684\u5206\u5757\u8fc7\u7a0b\uff0c\u786e\u4fdd\u5728\u4fe1\u606f\u68c0\u7d22\u65f6\u4fdd\u6301\u9ad8\u4fe1\u566a\u6bd4\u3002zChunk\u7279\u522b\u9002\u7528\u4e8e\u9700\u8981\u9ad8\u7cbe\u5ea6\u68c0\u7d22\u7684RAG\uff08\u68c0\u7d22\u589e\u5f3a\u751f\u6210\uff09\u5e94\u7528\uff0c\u89e3\u51b3\u4e86\u4f20\u7edf\u5206\u5757\u65b9\u6cd5\u5728\u5904\u7406\u590d\u6742\u6587\u6863\u65f6\u7684\u5c40\u9650\u6027\u3002\u901a\u8fc7zChunk\uff0c\u7528\u6237\u53ef\u4ee5\u66f4\u6709\u6548\u5730\u5c06\u6587\u6863\u5206\u5272\u6210\u6709\u610f\u4e49\u7684\u5757\uff0c\u4ece\u800c\u63d0\u9ad8\u4fe1\u606f\u68c0\u7d22\u7684\u51c6\u786e\u6027\u548c\u6548\u7387\u3002<\/p>\n<blockquote><p>Your job is to act as a chunker.<\/p>\n<p>You should insert the &#8220;\u6bb5&#8221; throughout the input.<\/p>\n<p>Your goal is to separate the content into semantically relevant groupings.<\/p>\n<p>\u65b9\u6cd5\u548c\u00a0<a href=\"https:\/\/www.kdjingpai.com\/llm-ocr-dejuxianxing\/\">LLM OCR \u7684\u5c40\u9650\u6027\uff1a\u5149\u9c9c\u5916\u8868\u4e0b\u7684\u6587\u6863\u89e3\u6790\u96be\u9898<\/a> \u63d0\u5230\u7684PROMPT\u6709\u4e00\u4e9b\u5171\u6027\u3002<\/p><\/blockquote>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-20722\" title=\"zChunk\uff1a\u57fa\u4e8eLlama-70B\u7684\u901a\u7528\u8bed\u4e49\u5206\u5757\u7b56\u7565-1\" src=\"https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/02\/34147d1795e4268.jpg\" alt=\"zChunk\uff1a\u57fa\u4e8eLlama-70B\u7684\u901a\u7528\u8bed\u4e49\u5206\u5757\u7b56\u7565-1\" width=\"1946\" height=\"1304\" srcset=\"https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/02\/34147d1795e4268.jpg 1946w, https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/02\/34147d1795e4268-768x515.jpg 768w\" sizes=\"auto, (max-width: 1946px) 100vw, 1946px\" \/><\/p>\n<p>&nbsp;<\/p>\n<h2>\u529f\u80fd\u5217\u8868<\/h2>\n<ul>\n<li><strong>\u57fa\u4e8eLlama-70B\u7684\u5206\u5757\u7b97\u6cd5<\/strong>\uff1a\u5229\u7528Llama-70B\u6a21\u578b\u751f\u6210\u63d0\u793a\uff0c\u8fdb\u884c\u8bed\u4e49\u5206\u5757\u3002<\/li>\n<li><strong>\u9ad8\u4fe1\u566a\u6bd4\u5206\u5757<\/strong>\uff1a\u4f18\u5316\u5206\u5757\u7b56\u7565\uff0c\u786e\u4fdd\u68c0\u7d22\u5230\u7684\u4fe1\u606f\u5177\u6709\u9ad8\u4fe1\u566a\u6bd4\u3002<\/li>\n<li><strong>\u591a\u79cd\u5206\u5757\u7b56\u7565<\/strong>\uff1a\u652f\u6301\u56fa\u5b9a\u5927\u5c0f\u5206\u5757\u3001\u57fa\u4e8e\u5d4c\u5165\u76f8\u4f3c\u5ea6\u7684\u5206\u5757\u7b49\u591a\u79cd\u7b56\u7565\u3002<\/li>\n<li><strong>\u8d85\u53c2\u6570\u8c03\u4f18<\/strong>\uff1a\u63d0\u4f9b\u8d85\u53c2\u6570\u8c03\u4f18\u7ba1\u9053\uff0c\u7528\u6237\u53ef\u4ee5\u6839\u636e\u5177\u4f53\u9700\u6c42\u8c03\u6574\u5206\u5757\u5927\u5c0f\u548c\u91cd\u53e0\u53c2\u6570\u3002<\/li>\n<li><strong>\u5f00\u6e90\u4ee3\u7801<\/strong>\uff1a\u63d0\u4f9b\u5b8c\u6574\u7684\u5f00\u6e90\u4ee3\u7801\uff0c\u7528\u6237\u53ef\u4ee5\u81ea\u7531\u4f7f\u7528\u548c\u4fee\u6539\u3002<\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<h2>\u4f7f\u7528\u5e2e\u52a9<\/h2>\n<h3>\u5b89\u88c5\u6d41\u7a0b<\/h3>\n<ol>\n<li><strong>\u514b\u9686\u4ed3\u5e93<\/strong>\uff1a<\/li>\n<\/ol>\n<pre><code>   git clone https:\/\/github.com\/zeroentropy-ai\/zchunk.git\r\ncd zchunk\r\n<\/code><\/pre>\n<ol start=\"2\">\n<li><strong>\u5b89\u88c5\u4f9d\u8d56<\/strong>\uff1a<\/li>\n<\/ol>\n<pre><code>   pip install -r requirements.txt\r\n<\/code><\/pre>\n<h3>\u4f7f\u7528\u65b9\u6cd5<\/h3>\n<ol>\n<li><strong>\u51c6\u5907\u8f93\u5165\u6587\u4ef6<\/strong>\uff1a\u5c06\u9700\u8981\u5206\u5757\u7684\u6587\u6863\u4fdd\u5b58\u4e3a\u6587\u672c\u6587\u4ef6\uff0c\u4f8b\u5982<code>example_input.txt<\/code>\u3002<\/li>\n<li><strong>\u8fd0\u884c\u5206\u5757\u811a\u672c<\/strong>\uff1a<\/li>\n<\/ol>\n<pre><code>   python test.py --input example_input.txt --output example_output.txt\r\n<\/code><\/pre>\n<ol start=\"3\">\n<li><strong>\u67e5\u770b\u8f93\u51fa\u6587\u4ef6<\/strong>\uff1a\u5206\u5757\u7ed3\u679c\u5c06\u4fdd\u5b58\u5728<code>example_output.txt<\/code>\u4e2d\u3002<\/li>\n<\/ol>\n<h3>\u8be6\u7ec6\u529f\u80fd\u64cd\u4f5c\u6d41\u7a0b<\/h3>\n<ol>\n<li><strong>\u9009\u62e9\u5206\u5757\u7b56\u7565<\/strong>\uff1a\n<ul>\n<li><strong>NaiveChunk<\/strong>\uff1a\u56fa\u5b9a\u5927\u5c0f\u5206\u5757\uff0c\u9002\u7528\u4e8e\u7b80\u5355\u6587\u6863\u3002<\/li>\n<li><strong>SemanticChunk<\/strong>\uff1a\u57fa\u4e8e\u5d4c\u5165\u76f8\u4f3c\u5ea6\u7684\u5206\u5757\uff0c\u9002\u7528\u4e8e\u9700\u8981\u4fdd\u6301\u8bed\u4e49\u5b8c\u6574\u6027\u7684\u6587\u6863\u3002<\/li>\n<li><strong>zChunk Algorithm<\/strong>\uff1a\u57fa\u4e8eLlama-70B\u6a21\u578b\u7684\u63d0\u793a\u751f\u6210\u5206\u5757\uff0c\u9002\u7528\u4e8e\u590d\u6742\u6587\u6863\u3002<\/li>\n<\/ul>\n<\/li>\n<li><strong>\u8c03\u6574\u8d85\u53c2\u6570<\/strong>\uff1a\n<ul>\n<li><strong>\u5206\u5757\u5927\u5c0f<\/strong>\uff1a\u53ef\u4ee5\u901a\u8fc7\u8c03\u6574\u53c2\u6570<code>chunk_size<\/code>\u6765\u8bbe\u7f6e\u6bcf\u4e2a\u5206\u5757\u7684\u5927\u5c0f\u3002<\/li>\n<li><strong>\u91cd\u53e0\u6bd4\u4f8b<\/strong>\uff1a\u901a\u8fc7\u53c2\u6570<code>overlap_ratio<\/code>\u8bbe\u7f6e\u5206\u5757\u4e4b\u95f4\u7684\u91cd\u53e0\u6bd4\u4f8b\uff0c\u786e\u4fdd\u4fe1\u606f\u7684\u8fde\u7eed\u6027\u3002<\/li>\n<\/ul>\n<\/li>\n<li><strong>\u8fd0\u884c\u8d85\u53c2\u6570\u8c03\u4f18<\/strong>\uff1a<\/li>\n<\/ol>\n<pre><code>   python hyperparameter_tuning.py --input example_input.txt --output tuned_output.txt\r\n<\/code><\/pre>\n<p>\u8be5\u811a\u672c\u5c06\u6839\u636e\u8f93\u5165\u6587\u6863\u81ea\u52a8\u8c03\u6574\u5206\u5757\u5927\u5c0f\u548c\u91cd\u53e0\u6bd4\u4f8b\uff0c\u751f\u6210\u6700\u4f18\u5206\u5757\u7ed3\u679c\u3002<\/p>\n<ol start=\"4\">\n<li><strong>\u8bc4\u4f30\u5206\u5757\u6548\u679c<\/strong>\uff1a\n<ul>\n<li>\u4f7f\u7528\u63d0\u4f9b\u7684\u8bc4\u4f30\u811a\u672c\u5bf9\u5206\u5757\u7ed3\u679c\u8fdb\u884c\u8bc4\u4f30\uff0c\u786e\u4fdd\u5206\u5757\u7b56\u7565\u7684\u6709\u6548\u6027\u3002<\/li>\n<\/ul>\n<\/li>\n<\/ol>\n<pre><code>   python evaluate.py --input example_input.txt --output example_output.txt\r\n<\/code><\/pre>\n<h3>\u793a\u4f8b<\/h3>\n<p>\u5047\u8bbe\u6211\u4eec\u6709\u4e00\u6bb5\u7f8e\u56fd\u5baa\u6cd5\u7684\u6587\u672c\uff0c\u9700\u8981\u8fdb\u884c\u5206\u5757\uff1a<\/p>\n<p>\u539f\u59cb\u6587\u672c\uff1a<\/p>\n<pre><code>Section. 1.\r\nAll legislative Powers herein granted shall be vested in a Congress of the United States, which shall consist of a Senate and House of Representatives.\r\nSection. 2.\r\nThe House of Representatives shall be composed of Members chosen every second Year by the People of the several States, and the Electors in each State shall have the Qualifications requisite for Electors of the most numerous Branch of the State Legislature.\r\nNo Person shall be a Representative who shall not have attained to the Age of twenty five Years, and been seven Years a Citizen of the United States, and who shall not, when elected, be an Inhabitant of that State in which he shall be chosen.\r\n<\/code><\/pre>\n<p>\u4f7f\u7528zChunk\u7b97\u6cd5\u8fdb\u884c\u5206\u5757\uff1a<\/p>\n<ol>\n<li><strong>\u9009\u62e9\u63d0\u793a\u8bcd<\/strong>\uff1a\u9009\u62e9\u4e00\u4e2a\u7279\u6b8a\u7684\u3001\u4e0d\u5728\u8bed\u6599\u5e93\u4e2d\u7684\u6807\u8bb0\uff08\u4f8b\u5982\u201c\u6bb5\u201d\uff09\u3002<\/li>\n<li><strong>\u63d2\u5165\u63d0\u793a\u8bcd<\/strong>\uff1a\u8ba9Llama\u5728\u7528\u6237\u6d88\u606f\u4e2d\u63d2\u5165\u8be5\u6807\u8bb0\u3002<\/li>\n<\/ol>\n<pre><code>   SYSTEM_PROMPT (\u7b80\u5316\u7248)\uff1a\r\n\u4f60\u7684\u4efb\u52a1\u662f\u4f5c\u4e3a\u4e00\u4e2a\u5206\u5757\u5668\u3002\r\n\u4f60\u5e94\u8be5\u5728\u8f93\u5165\u4e2d\u63d2\u5165\u201c\u6bb5\u201d\u6807\u8bb0\u3002\r\n\u4f60\u7684\u76ee\u6807\u662f\u5c06\u5185\u5bb9\u5206\u6210\u8bed\u4e49\u76f8\u5173\u7684\u7ec4\u3002\r\n<\/code><\/pre>\n<ol start=\"3\">\n<li><strong>\u751f\u6210\u5206\u5757<\/strong>\uff1a<\/li>\n<\/ol>\n<pre><code>   Section. 1.\r\nAll legislative Powers herein granted shall be vested in a Congress of the United States, which shall consist of a Senate and House of Representatives.\u6bb5\r\nSection. 2.\r\nThe House of Representatives shall be composed of Members chosen every second Year by the People of the several States, and the Electors in each State shall have the Qualifications requisite for Electors of the most numerous Branch of the State Legislature.\u6bb5\r\nNo Person shall be a Representative who shall not have attained to the Age of twenty five Years, and been seven Years a Citizen of the United States, and who shall not, when elected, be an Inhabitant of that State in which he shall be chosen.\u6bb5\r\n<\/code><\/pre>\n<p>\u901a\u8fc7\u8fd9\u79cd\u65b9\u5f0f\uff0c\u6211\u4eec\u53ef\u4ee5\u5c06\u6587\u6863\u5206\u5272\u6210\u8bed\u4e49\u76f8\u5173\u7684\u5757\uff0c\u6bcf\u4e2a\u5757\u90fd\u53ef\u4ee5\u72ec\u7acb\u68c0\u7d22\uff0c\u63d0\u9ad8\u4e86\u4fe1\u606f\u68c0\u7d22\u7684\u4fe1\u566a\u6bd4\u548c\u51c6\u786e\u6027\u3002<\/p>\n<h3>\u4f18\u5316<\/h3>\n<ul>\n<li>\u901a\u8fc7\u672c\u5730\u63a8\u7406Llama\uff0c\u53ef\u4ee5\u9ad8\u6548\u5730\u5904\u7406\u6574\u4e2a\u6bb5\u843d\uff0c\u5e76\u68c0\u67e5logprobs\u4ee5\u786e\u5b9a\u5206\u5757\u4f4d\u7f6e\u3002<\/li>\n<li>\u5904\u7406450,000\u5b57\u7b26\u5927\u7ea6\u9700\u898115\u5206\u949f\uff0c\u4f46\u5982\u679c\u4f18\u5316\u4ee3\u7801\uff0c\u53ef\u4ee5\u663e\u8457\u51cf\u5c11\u65f6\u95f4\u3002<\/li>\n<\/ul>\n<h3>\u57fa\u51c6\u6d4b\u8bd5<\/h3>\n<ul>\n<li>zChunk\u5728LegalBenchConsumerContractsQA\u6570\u636e\u96c6\u4e0a\u7684\u68c0\u7d22\u6bd4\u548c\u4fe1\u53f7\u6bd4\u5f97\u5206\u9ad8\u4e8eNaiveChunk\u548c\u8bed\u4e49\u5206\u5757\u65b9\u6cd5\u3002<\/li>\n<\/ul>\n<p>\u901a\u8fc7zChunk\u7b97\u6cd5\uff0c\u6211\u4eec\u53ef\u4ee5\u5728\u4e0d\u4f9d\u8d56\u6b63\u5219\u8868\u8fbe\u5f0f\u6216\u624b\u52a8\u521b\u5efa\u89c4\u5219\u7684\u60c5\u51b5\u4e0b\uff0c\u8f7b\u677e\u5206\u5272\u4efb\u4f55\u7c7b\u578b\u7684\u6587\u6863\uff0c\u63d0\u9ad8RAG\u5e94\u7528\u7684\u6548\u7387\u548c\u51c6\u786e\u6027\u3002<\/p>\n","protected":false},"excerpt":{"rendered":"<p>zChunk\u662f\u7531ZeroEntropy\u5f00\u53d1\u7684\u4e00\u79cd\u65b0\u578b\u5206\u5757\u7b56\u7565\uff0c\u65e8\u5728\u4e3a\u901a\u7528\u8bed\u4e49\u5206\u5757\u63d0\u4f9b\u89e3\u51b3\u65b9\u6848\u3002\u8be5\u7b56\u7565\u57fa\u4e8eLlama-70B\u6a21\u578b\uff0c\u901a\u8fc7\u63d0\u793a\u751f\u6210\u5206\u5757\uff0c\u4f18\u5316\u4e86\u6587\u6863\u7684\u5206\u5757\u8fc7\u7a0b\uff0c\u786e\u4fdd\u5728\u4fe1\u606f\u68c0\u7d22\u65f6\u4fdd\u6301\u9ad8\u4fe1\u566a\u6bd4\u3002zChunk\u7279\u522b\u9002\u7528\u4e8e\u9700\u8981\u9ad8\u7cbe\u5ea6\u68c0\u7d22\u7684RA&#8230;<\/p>\n","protected":false},"author":1,"featured_media":32782,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20,499],"tags":[230,252],"class_list":["post-20721","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-tool","category-document-extraction","tag-aikaiyuanxiangmu","tag-markdown"],"_links":{"self":[{"href":"https:\/\/www.kdjingpai.com\/de\/wp-json\/wp\/v2\/posts\/20721","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.kdjingpai.com\/de\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.kdjingpai.com\/de\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.kdjingpai.com\/de\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.kdjingpai.com\/de\/wp-json\/wp\/v2\/comments?post=20721"}],"version-history":[{"count":0,"href":"https:\/\/www.kdjingpai.com\/de\/wp-json\/wp\/v2\/posts\/20721\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.kdjingpai.com\/de\/wp-json\/wp\/v2\/media\/32782"}],"wp:attachment":[{"href":"https:\/\/www.kdjingpai.com\/de\/wp-json\/wp\/v2\/media?parent=20721"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.kdjingpai.com\/de\/wp-json\/wp\/v2\/categories?post=20721"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.kdjingpai.com\/de\/wp-json\/wp\/v2\/tags?post=20721"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}