{"id":43400,"date":"2025-08-22T01:24:39","date_gmt":"2025-08-21T17:24:39","guid":{"rendered":"https:\/\/www.kdjingpai.com\/?p=43400"},"modified":"2025-08-22T01:24:39","modified_gmt":"2025-08-21T17:24:39","slug":"rag-xitongdejishi","status":"publish","type":"post","link":"https:\/\/www.kdjingpai.com\/de\/rag-xitongdejishi\/","title":{"rendered":"RAG \u7cfb\u7edf\u7684\u57fa\u77f3\uff1a\u6df1\u5165\u89e3\u6790\u6587\u6863\u5206\u5757\u7b56\u7565"},"content":{"rendered":"<p>\u4e00\u4e2a\u73b0\u8c61\u5f88\u5e38\u89c1\uff1a\u5373\u4f7f <a href=\"https:\/\/www.kdjingpai.com\/de\/rag\/\">RAG<\/a> \u7cfb\u7edf\u7528\u4e86\u6700\u5f3a\u7684 LLM\uff0cPrompt \u4e5f\u7ecf\u8fc7\u4e86\u53cd\u590d\u8c03\u6821\uff0c\u95ee\u7b54\u6548\u679c\u4f9d\u7136\u4e0d\u7406\u60f3\uff0c\u7b54\u6848\u8981\u4e48\u4e0a\u4e0b\u6587\u4e0d\u5168\uff0c\u8981\u4e48\u5b58\u5728\u4e8b\u5b9e\u9519\u8bef\u3002<\/p>\n<p>\u5de5\u7a0b\u5e08\u4eec\u68c0\u67e5\u4e86\u68c0\u7d22\u7b97\u6cd5\uff0c\u4f18\u5316\u4e86\u00a0<code>Embedding<\/code>\u00a0\u6a21\u578b\uff0c\u4f46\u5e38\u5e38\u5ffd\u7565\u4e86\u6570\u636e\u8fdb\u5165\u5411\u91cf\u5e93\u4e4b\u524d\u7684\u5173\u952e\u6b65\u9aa4\uff1a\u6587\u6863\u5206\u5757\u3002<\/p>\n<p>\u4e0d\u6070\u5f53\u7684\u5206\u5757\uff0c\u7b49\u4e8e\u7ed9\u6a21\u578b\u5582\u4e86\u4e00\u5806\u4fe1\u606f\u6b8b\u7f3a\u7684\u201c\u574f\u6570\u636e\u201d\u3002\u6a21\u578b\u7684\u63a8\u7406\u80fd\u529b\u518d\u5f3a\uff0c\u4e5f\u65e0\u6cd5\u4ece\u788e\u7247\u5316\u7684\u77e5\u8bc6\u4e2d\u62fc\u51d1\u51fa\u5b8c\u6574\u7b54\u6848\u3002\u5206\u5757\u7684\u8d28\u91cf\uff0c\u76f4\u63a5\u51b3\u5b9a\u4e86 RAG \u7cfb\u7edf\u6027\u80fd\u7684\u4e0b\u9650\u3002<\/p>\n<p>\u8fd9\u7bc7\u6587\u7ae0\u4e0d\u8c08\u7a7a\u6cdb\u7684\u7406\u8bba\uff0c\u800c\u662f\u805a\u7126\u4e8e\u5404\u7c7b\u5206\u5757\u7b56\u7565\u7684\u5b9e\u6218\u4ee3\u7801\u548c\u5de5\u7a0b\u7ecf\u9a8c\uff0c\u4e3a RAG \u7cfb\u7edf\u6253\u9020\u575a\u5b9e\u7684\u5730\u57fa\u3002<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-43396\" src=\"https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/08\/938b86b37563dad.jpeg\" alt=\"\" width=\"960\" height=\"409\" srcset=\"https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/08\/938b86b37563dad.jpeg 960w, https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/08\/938b86b37563dad-18x8.jpeg 18w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/><\/p>\n<p>&nbsp;<\/p>\n<h3>\u4e3a\u4f55\u8981\u5206\u5757\uff1f<\/h3>\n<p>\u5206\u5757\u7684\u5fc5\u8981\u6027\u6e90\u4e8e\u4e24\u4e2a\u6838\u5fc3\u9650\u5236\uff1a<\/p>\n<ul>\n<li><strong>\u6a21\u578b\u4e0a\u4e0b\u6587\u7a97\u53e3<\/strong>\uff1a\u5927\u8bed\u8a00\u6a21\u578b\uff08LLM\uff09\u65e0\u6cd5\u4e00\u6b21\u6027\u5904\u7406\u65e0\u9650\u957f\u5ea6\u7684\u6587\u672c\u3002\u5206\u5757\u662f\u5c06\u957f\u6587\u6863\u5207\u5206\u4e3a\u6a21\u578b\u53ef\u4ee5\u5904\u7406\u7684\u7247\u6bb5\u3002<\/li>\n<li><strong>\u68c0\u7d22\u6548\u7387\u4e0e\u566a\u58f0<\/strong>\uff1a\u5728\u68c0\u7d22\u65f6\uff0c\u5982\u679c\u4e00\u4e2a\u6587\u672c\u5757\u5305\u542b\u8fc7\u591a\u65e0\u5173\u4fe1\u606f\uff08\u566a\u58f0\uff09\uff0c\u5c31\u4f1a\u7a00\u91ca\u6838\u5fc3\u4fe1\u53f7\uff0c\u5bfc\u81f4\u68c0\u7d22\u5668\u96be\u4ee5\u7cbe\u786e\u5339\u914d\u7528\u6237\u610f\u56fe\u3002<\/li>\n<\/ul>\n<p>\u7406\u60f3\u7684\u5206\u5757\u662f\u5728<strong>\u4e0a\u4e0b\u6587\u5b8c\u6574\u6027<\/strong>\u4e0e<strong>\u4fe1\u606f\u5bc6\u5ea6<\/strong>\u4e4b\u95f4\u627e\u5230\u5e73\u8861\u3002<code>chunk_size<\/code>\u00a0\u548c\u00a0<code>chunk_overlap<\/code>\u00a0\u662f\u8c03\u63a7\u8fd9\u4e00\u5e73\u8861\u7684\u57fa\u7840\u53c2\u6570\u3002<code>chunk_overlap<\/code>\u00a0\u901a\u8fc7\u5728\u76f8\u90bb\u5757\u4e4b\u95f4\u4fdd\u7559\u90e8\u5206\u91cd\u590d\u6587\u672c\uff0c\u6765\u786e\u4fdd\u8de8\u8d8a\u5757\u8fb9\u754c\u7684\u8bed\u4e49\u8fde\u7eed\u6027\u3002<\/p>\n<h2>\u57fa\u7840\u5206\u5757\u7b56\u7565<\/h2>\n<h3>\u56fa\u5b9a\u957f\u5ea6\u5206\u5757<\/h3>\n<p>\u8fd9\u662f\u6700\u76f4\u63a5\u7684\u65b9\u6cd5\uff0c\u6309\u9884\u8bbe\u7684\u5b57\u7b26\u6570\u5207\u5272\u3002\u5b83\u4e0d\u8003\u8651\u6587\u672c\u7684\u4efb\u4f55\u903b\u8f91\u7ed3\u6784\uff0c\u5b9e\u73b0\u7b80\u5355\uff0c\u4f46\u5bb9\u6613\u7834\u574f\u8bed\u4e49\u5b8c\u6574\u6027\u3002<\/p>\n<ul>\n<li><strong>\u6838\u5fc3\u601d\u60f3<\/strong>\uff1a\u6309\u56fa\u5b9a\u5b57\u7b26\u6570\u00a0<code>chunk_size<\/code>\u00a0\u5207\u5206\u6587\u672c\u3002<\/li>\n<li><strong>\u9002\u7528\u573a\u666f<\/strong>\uff1a\u7ed3\u6784\u6027\u5f31\u7684\u7eaf\u6587\u672c\uff0c\u6216\u5bf9\u8bed\u4e49\u8981\u6c42\u4e0d\u9ad8\u7684\u9884\u5904\u7406\u9636\u6bb5\u3002<\/li>\n<\/ul>\n<pre><code>from langchain_text_splitters import CharacterTextSplitter\r\nsample_text = (\r\n\"LangChain was created by Harrison Chase in 2022. It provides a framework for developing applications \"\r\n\"powered by language models. The library is known for its modularity and ease of use. \"\r\n\"One of its key components is the TextSplitter class, which helps in document chunking.\"\r\n)\r\ntext_splitter = CharacterTextSplitter(\r\nseparator=\" \",      # Split on spaces\r\nchunk_size=100,     # Size of each chunk\r\nchunk_overlap=20,   # Overlap between chunks\r\nlength_function=len,\r\n)\r\ndocs = text_splitter.create_documents([sample_text])\r\nfor i, doc in enumerate(docs):\r\nprint(f\"--- Chunk {i+1} ---\")\r\nprint(doc.page_content)\r\n<\/code><\/pre>\n<h3>\u9012\u5f52\u5b57\u7b26\u5206\u5757<\/h3>\n<p><code>LangChain<\/code>\u00a0\u63a8\u8350\u7684\u901a\u7528\u7b56\u7565\u3002\u5b83\u6309\u9884\u8bbe\u7684\u5b57\u7b26\u5217\u8868\uff08\u4f8b\u5982\u00a0<code>[\"\\n\\n\", \"\\n\", \" \", \"\"]<\/code>\uff09\u8fdb\u884c\u9012\u5f52\u5206\u5272\uff0c\u5c1d\u8bd5\u4f18\u5148\u4fdd\u7559\u6bb5\u843d\u3001\u53e5\u5b50\u7b49\u903b\u8f91\u5355\u5143\u3002<\/p>\n<ul>\n<li><strong>\u6838\u5fc3\u601d\u60f3<\/strong>\uff1a\u6309\u5c42\u6b21\u5316\u7684\u5206\u9694\u7b26\u5217\u8868\u8fdb\u884c\u9012\u5f52\u5207\u5206\u3002<\/li>\n<li><strong>\u9002\u7528\u573a\u666f<\/strong>\uff1a\u7edd\u5927\u591a\u6570\u6587\u672c\u7c7b\u578b\u7684\u9996\u9009\u901a\u7528\u7b56\u7565\u3002<\/li>\n<\/ul>\n<pre><code>from langchain_text_splitters import RecursiveCharacterTextSplitter\r\n# Using the same sample_text from the previous example\r\ntext_splitter = RecursiveCharacterTextSplitter(\r\nchunk_size=100,\r\nchunk_overlap=20,\r\n# Default separators are [\"\\n\\n\", \"\\n\", \" \", \"\"]\r\n)\r\ndocs = text_splitter.create_documents([sample_text])\r\nfor i, doc in enumerate(docs):\r\nprint(f\"--- Chunk {i+1} ---\")\r\nprint(doc.page_content)\r\n<\/code><\/pre>\n<p><strong>\u53c2\u6570\u8c03\u4f18<\/strong>\uff1a\u5bf9\u4e8e\u56fa\u5b9a\u957f\u5ea6\u548c\u9012\u5f52\u5206\u5757\uff0c<code>chunk_size<\/code>\u00a0\u548c\u00a0<code>chunk_overlap<\/code>\u00a0\u7684\u8bbe\u7f6e\u81f3\u5173\u91cd\u8981\u3002<\/p>\n<ul>\n<li><strong><code>chunk_size<\/code><\/strong>\uff1a\u51b3\u5b9a\u6bcf\u4e2a\u5757\u7684\u5927\u5c0f\u3002\u5757\u592a\u5c0f\uff0c\u4e0a\u4e0b\u6587\u4fe1\u606f\u4e0d\u8db3\uff1b\u5757\u592a\u5927\uff0c\u5f15\u5165\u8fc7\u591a\u566a\u58f0\uff0c\u589e\u52a0\u00a0<code>API<\/code>\u00a0\u8c03\u7528\u6210\u672c\u3002\u8fd9\u4e2a\u503c\u901a\u5e38\u6839\u636e\u00a0<code>Embedding<\/code>\u00a0\u6a21\u578b\u7684\u8f93\u5165\u00a0<code><a href=\"https:\/\/www.kdjingpai.com\/de\/tokenization\/\">token<\/a><\/code>\u00a0\u9650\u5236\u6765\u9009\u62e9\uff0c\u5e38\u89c1\u7684\u00a0<code>256<\/code>,\u00a0<code>512<\/code>,\u00a0<code>1024<\/code>\u00a0\u7b49\u503c\uff0c\u6b63\u662f\u4e3a\u4e86\u9002\u914d\u00a0<code>BERT<\/code>\u00a0\u7b49\u6a21\u578b\u7684\u00a0<code>512<\/code>\u00a0<code>token<\/code>\u00a0\u4e0a\u4e0b\u6587\u7a97\u53e3\u3002<\/li>\n<li><strong><code>chunk_overlap<\/code><\/strong>\uff1a\u51b3\u5b9a\u76f8\u90bb\u5757\u4e4b\u95f4\u7684\u91cd\u53e0\u5b57\u7b26\u6570\u3002\u8bbe\u7f6e\u5408\u7406\u7684\u91cd\u53e0\uff08\u4f8b\u5982\u00a0<code>chunk_size<\/code>\u00a0\u7684 10%-20%\uff09\u53ef\u4ee5\u6709\u6548\u9632\u6b62\u5728\u5757\u8fb9\u754c\u5904\u5207\u65ad\u5b8c\u6574\u7684\u8bed\u4e49\u5355\u5143\uff0c\u662f\u4fdd\u8bc1\u8bed\u4e49\u8fde\u7eed\u6027\u7684\u5173\u952e\u3002<\/li>\n<\/ul>\n<h3>\u57fa\u4e8e\u53e5\u5b50\u7684\u5206\u5757<\/h3>\n<p>\u4ee5\u53e5\u5b50\u4e3a\u6700\u5c0f\u5355\u5143\u8fdb\u884c\u7ec4\u5408\uff0c\u786e\u4fdd\u4e86\u6700\u57fa\u672c\u7684\u8bed\u4e49\u5b8c\u6574\u6027\u3002<\/p>\n<ul>\n<li><strong>\u6838\u5fc3\u601d\u60f3<\/strong>\uff1a\u5c06\u6587\u672c\u5206\u5272\u6210\u53e5\u5b50\uff0c\u518d\u5c06\u53e5\u5b50\u805a\u5408\u6210\u5757\u3002<\/li>\n<li><strong>\u9002\u7528\u573a\u666f<\/strong>\uff1a\u5bf9\u53e5\u5b50\u5b8c\u6574\u6027\u8981\u6c42\u9ad8\u7684\u573a\u666f\uff0c\u5982\u6cd5\u5f8b\u6587\u4e66\u3001\u65b0\u95fb\u62a5\u9053\u3002<\/li>\n<\/ul>\n<pre><code>import nltk\r\ntry:\r\nnltk.data.find('tokenizers\/punkt')\r\nexcept nltk.downloader.DownloadError:\r\nnltk.download('punkt')\r\nfrom nltk.tokenize import sent_tokenize\r\ndef chunk_by_sentences(text, max_chars=500, overlap_sentences=1):\r\nsentences = sent_tokenize(text)\r\nchunks = []\r\ncurrent_chunk = \"\"\r\nfor i, sentence in enumerate(sentences):\r\nif len(current_chunk) + len(sentence) &lt;= max_chars:\r\ncurrent_chunk += \" \" + sentence\r\nelse:\r\nchunks.append(current_chunk.strip())\r\n# Create overlap\r\nstart_index = max(0, i - overlap_sentences)\r\ncurrent_chunk = \" \".join(sentences[start_index:i+1])\r\nif current_chunk:\r\nchunks.append(current_chunk.strip())\r\nreturn chunks\r\nlong_text = \"This is the first sentence. This is the second sentence, which is a bit longer. Now we have a third one. The fourth sentence follows. Finally, the fifth sentence concludes this paragraph.\"\r\nchunks = chunk_by_sentences(long_text, max_chars=100)\r\nfor i, chunk in enumerate(chunks):\r\nprint(f\"--- Chunk {i+1} ---\")\r\nprint(chunk)\r\n<\/code><\/pre>\n<p><strong>\u6ce8\u610f<\/strong>\uff1a\u5904\u7406\u4e2d\u6587\u65f6\uff0c<code>nltk.tokenize.sent_tokenize<\/code>\u00a0\u9ed8\u8ba4\u7684\u82f1\u6587\u6a21\u578b\u4f1a\u5931\u6548\u3002\u5fc5\u987b\u91c7\u7528\u9002\u5408\u4e2d\u6587\u7684\u5206\u5272\u65b9\u6cd5\uff0c\u4f8b\u5982\u57fa\u4e8e\u4e2d\u6587\u6807\u70b9\u7b26\u53f7\uff08\u3002\uff01\uff1f\uff09\u7684\u6b63\u5219\u8868\u8fbe\u5f0f\uff0c\u6216\u4f7f\u7528\u52a0\u8f7d\u4e86\u4e2d\u6587\u6a21\u578b\u7684\u00a0<code>spaCy<\/code>\u00a0\u6216\u00a0<code>HanLP<\/code>\u00a0\u7b49\u5e93\u3002<\/p>\n<h2>\u7ed3\u6784\u611f\u77e5\u5206\u5757<\/h2>\n<p>\u5229\u7528\u6587\u6863\u56fa\u6709\u7684\u7ed3\u6784\u4fe1\u606f\uff08\u5982\u6807\u9898\u3001\u5217\u8868\uff09\u4f5c\u4e3a\u5206\u5757\u8fb9\u754c\uff0c\u8fd9\u79cd\u65b9\u6cd5\u903b\u8f91\u6027\u5f3a\uff0c\u80fd\u66f4\u597d\u5730\u4fdd\u7559\u4e0a\u4e0b\u6587\u3002<\/p>\n<h3>Markdown \u6587\u672c\u5206\u5757<\/h3>\n<ul>\n<li><strong>\u6838\u5fc3\u601d\u60f3<\/strong>\uff1a\u6839\u636e\u00a0<code>Markdown<\/code>\u00a0\u7684\u6807\u9898\u5c42\u7ea7\u6765\u5b9a\u4e49\u5757\u7684\u8fb9\u754c\u3002<\/li>\n<li><strong>\u9002\u7528\u573a\u666f<\/strong>\uff1a\u683c\u5f0f\u89c4\u8303\u7684\u00a0<code>Markdown<\/code>\u00a0\u6587\u6863\uff0c\u5982\u00a0<code>GitHub<\/code>\u00a0<code>README<\/code>, \u6280\u672f\u6587\u6863\u3002<\/li>\n<\/ul>\n<pre><code>from langchain_text_splitters import MarkdownHeaderTextSplitter\r\nmarkdown_document = \"\"\"\r\n# Chapter 1: The Beginning\r\n## Section 1.1: The Old World\r\nThis is the <a href=\"https:\/\/www.kdjingpai.com\/de\/storyvideos\/\">story<\/a> of a time long past.\r\n## Section 1.2: A New Hope\r\nA new hero emerges.\r\n# Chapter 2: The Journey\r\n## Section 2.1: The Call to Adventure\r\nThe hero receives a mysterious call.\r\n\"\"\"\r\nheaders_to_split_on = [\r\n(\"#\", \"Header 1\"),\r\n(\"##\", \"Header 2\"),\r\n]\r\nmarkdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)\r\nmd_header_splits = markdown_splitter.split_text(markdown_document)\r\nfor split in md_header_splits:\r\nprint(f\"Metadata: {split.metadata}\")\r\nprint(split.page_content)\r\nprint(\"-\" * 20)\r\n<\/code><\/pre>\n<h3>\u5bf9\u8bdd\u5f0f\u5206\u5757<\/h3>\n<ul>\n<li><strong>\u6838\u5fc3\u601d\u60f3<\/strong>\uff1a\u6839\u636e\u5bf9\u8bdd\u7684\u53d1\u8a00\u4eba\u6216\u8f6e\u6b21\u8fdb\u884c\u5206\u5757\u3002<\/li>\n<li><strong>\u9002\u7528\u573a\u666f<\/strong>\uff1a\u5ba2\u670d\u5bf9\u8bdd\u3001\u8bbf\u8c08\u8bb0\u5f55\u3001\u4f1a\u8bae\u7eaa\u8981\u3002<\/li>\n<\/ul>\n<pre><code>dialogue = [\r\n\"Alice: Hi, I'm having trouble with my order.\",\r\n\"Bot: I can help with that. What's your order number?\",\r\n\"Alice: It's 12345.\",\r\n\"Alice: I haven't received any shipping updates.\",\r\n\"Bot: Let me check... It seems your order was shipped yesterday.\",\r\n\"Alice: Oh, great! Thank you.\",\r\n]\r\ndef chunk_dialogue(dialogue_lines, max_turns_per_chunk=3):\r\nchunks = []\r\nfor i in range(0, len(dialogue_lines), max_turns_per_chunk):\r\nchunk = \"\\n\".join(dialogue_lines[i:i + max_turns_per_chunk])\r\nchunks.append(chunk)\r\nreturn chunks\r\nchunks = chunk_dialogue(dialogue)\r\nfor i, chunk in enumerate(chunks):\r\nprint(f\"--- Chunk {i+1} ---\")\r\nprint(chunk)\r\n<\/code><\/pre>\n<h2>\u8bed\u4e49\u4e0e\u4e3b\u9898\u5206\u5757<\/h2>\n<p>\u8fd9\u7c7b\u65b9\u6cd5\u8d85\u8d8a\u6587\u672c\u7684\u7269\u7406\u7ed3\u6784\uff0c\u6839\u636e\u5185\u5bb9\u7684\u8bed\u4e49\u8fdb\u884c\u5207\u5206\u3002<\/p>\n<h3>\u8bed\u4e49\u5206\u5757<\/h3>\n<ul>\n<li><strong>\u6838\u5fc3\u601d\u60f3<\/strong>\uff1a\u8ba1\u7b97\u76f8\u90bb\u53e5\u5b50\u6216\u6bb5\u843d\u7684\u5411\u91cf\u76f8\u4f3c\u5ea6\uff0c\u5728\u8bed\u4e49\u53d1\u751f\u7a81\u53d8\uff08\u76f8\u4f3c\u5ea6\u4f4e\uff09\u7684\u4f4d\u7f6e\u8fdb\u884c\u5207\u5206\u3002<\/li>\n<li><strong>\u9002\u7528\u573a\u666f<\/strong>\uff1a\u77e5\u8bc6\u5e93\u3001\u7814\u7a76\u8bba\u6587\u7b49\u9700\u8981\u9ad8\u7cbe\u5ea6\u8bed\u4e49\u5185\u805a\u7684\u6587\u6863\u3002<\/li>\n<\/ul>\n<pre><code>import os\r\nfrom langchain_experimental.text_splitter import SemanticChunker\r\nfrom langchain_huggingface import HuggingFaceEmbeddings\r\nos.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"\r\nembeddings = HuggingFaceEmbeddings(model_name=\"sentence-transformers\/all-MiniLM-L6-v2\")\r\n# LangChain's SemanticChunker offers different threshold types:\r\n# \"percentile\": Threshold based on the percentile of similarity score differences. Good for adaptability.\r\n# \"standard_deviation\": Threshold based on standard deviation of similarity scores.\r\n# \"interquartile\": Uses the interquartile range, robust to outliers.\r\n# \"gradient\": Looks for sharp changes in similarity, useful for detecting abrupt topic shifts.\r\ntext_splitter = SemanticChunker(\r\nembeddings,\r\nbreakpoint_threshold_type=\"percentile\",\r\nbreakpoint_threshold_amount=95 # A higher percentile means it only breaks on very significant semantic shifts.\r\n)\r\nlong_text = (\r\n\"The Wright brothers, Orville and Wilbur, were two American aviation pioneers \"\r\n\"generally credited with inventing, building, and flying the world's first successful motor-operated airplane. \"\r\n\"They made the first controlled, sustained flight of a powered, heavier-than-air aircraft on December 17, 1903. \"\r\n\"In the following years, they continued to develop their aircraft. \"\r\n\"Switching topics completely, let's talk about cooking. \"\r\n\"A good pizza starts with a perfect dough, which needs yeast, flour, water, and salt. \"\r\n\"The sauce is typically tomato-based, seasoned with herbs like oregano and basil. \"\r\n\"Toppings can vary from simple mozzarella to a wide range of meats and vegetables. \"\r\n\"Finally, let's consider the solar system. \"\r\n\"It is a gravitationally bound system of the Sun and the objects that orbit it. \"\r\n\"The largest objects are the eight planets, in order from the Sun: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, and Neptune.\"\r\n)\r\ndocs = text_splitter.create_documents([long_text])\r\nfor i, doc in enumerate(docs):\r\nprint(f\"--- Chunk {i+1} ---\")\r\nprint(doc.page_content)\r\nprint()\r\n<\/code><\/pre>\n<p><strong>\u53c2\u6570\u8c03\u4f18<\/strong>\uff1a<code>SemanticChunker<\/code>\u00a0\u7684\u6548\u679c\u9ad8\u5ea6\u4f9d\u8d56\u00a0<code>breakpoint_threshold_amount<\/code>\u3002\u8fd9\u4e2a\u9608\u503c\u63a7\u5236\u7740\u201c\u8bed\u4e49\u53d8\u5316\u654f\u611f\u5ea6\u201d\u3002\u4f4e\u9608\u503c\u4f1a\u4ea7\u751f\u5927\u91cf\u5c0f\u800c\u5185\u805a\u7684\u5757\uff0c\u9ad8\u9608\u503c\u5219\u53ea\u5728\u8bdd\u9898\u53d1\u751f\u663e\u8457\u8f6c\u53d8\u65f6\u624d\u5207\u5206\u3002\u9700\u8981\u6839\u636e\u6587\u6863\u5185\u5bb9\u53cd\u590d\u5b9e\u9a8c\u3002<\/p>\n<h3>\u57fa\u4e8e\u4e3b\u9898\u7684\u5206\u5757<\/h3>\n<ul>\n<li><strong>\u6838\u5fc3\u601d\u60f3<\/strong>\uff1a\u5229\u7528\u4e3b\u9898\u6a21\u578b\uff08\u5982\u00a0<code>LDA<\/code>\uff09\u6216\u805a\u7c7b\u7b97\u6cd5\uff0c\u5728\u6587\u6863\u7684\u5b8f\u89c2\u4e3b\u9898\u53d1\u751f\u8f6c\u6362\u65f6\u8fdb\u884c\u5207\u5206\u3002<\/li>\n<li><strong>\u9002\u7528\u573a\u666f<\/strong>\uff1a\u957f\u7bc7\u3001\u591a\u4e3b\u9898\u7684\u62a5\u544a\u6216\u4e66\u7c4d\u3002<\/li>\n<\/ul>\n<pre><code>import numpy as np\r\nimport re\r\nfrom sklearn.feature_extraction.text import CountVectorizer\r\nfrom sklearn.decomposition import LatentDirichletAllocation\r\nimport nltk\r\nfrom nltk.corpus import stopwords\r\ntry:\r\nstopwords.words('english')\r\nexcept LookupError:\r\nnltk.download('stopwords')\r\ndef lda_topic_chunking(text: str, n_topics: int = 3) -&gt; list[str]:\r\n# 1. Preprocessing: Treat each paragraph as a \"document\"\r\nparagraphs = [p.strip() for p in text.split('\\n\\n') if p.strip()]\r\nif len(paragraphs) &lt;= 1:\r\nreturn [text]\r\ncleaned_paragraphs = [re.sub(r'[^a-zA-Z\\s]', '', p).lower() for p in paragraphs]\r\n# 2. Bag of Words + Stopword Removal\r\nvectorizer = CountVectorizer(min_df=1, stop_words=stopwords.words('english'))\r\nX = vectorizer.fit_transform(cleaned_paragraphs)\r\nif X.shape == 0:\r\nreturn paragraphs\r\n# 3. LDA Topic Modeling\r\nlda = LatentDirichletAllocation(n_components=n_topics, random_state=42)\r\nlda.fit(X)\r\n# 4. Determine dominant topic for each paragraph\r\ntopic_dist = lda.transform(X)\r\ndominant_topics = np.argmax(topic_dist, axis=1)\r\n# 5. Chunking based on topic changes\r\nchunks = []\r\ncurrent_chunk_paragraphs = []\r\ncurrent_topic = dominant_topics\r\nfor i, paragraph in enumerate(paragraphs):\r\nif dominant_topics[i] == current_topic:\r\ncurrent_chunk_paragraphs.append(paragraph)\r\nelse:\r\nchunks.append(\"\\n\\n\".join(current_chunk_paragraphs))\r\ncurrent_chunk_paragraphs = [paragraph]\r\ncurrent_topic = dominant_topics[i]\r\nchunks.append(\"\\n\\n\".join(current_chunk_paragraphs))\r\nreturn chunks\r\n<\/code><\/pre>\n<p><strong>\u6ce8\u610f<\/strong>\uff1a\u57fa\u4e8e\u4e3b\u9898\u7684\u5206\u5757\u5bf9\u6587\u672c\u957f\u5ea6\u3001\u4e3b\u9898\u533a\u5206\u5ea6\u548c\u9884\u5904\u7406\u6b65\u9aa4\u975e\u5e38\u654f\u611f\uff0c\u4e14\u9700\u8981\u9884\u8bbe\u4e3b\u9898\u6570\u3002\u6b64\u65b9\u6cd5\u66f4\u9002\u5408\u4f5c\u4e3a\u4e00\u79cd\u63a2\u7d22\u6027\u5de5\u5177\uff0c\u5728\u4e3b\u9898\u8fb9\u754c\u6e05\u6670\u7684\u957f\u6587\u6863\u4e0a\u4f7f\u7528\u3002<\/p>\n<h2>\u9ad8\u7ea7\u5206\u5757\u7b56\u7565<\/h2>\n<h3>\u5c0f-\u5927\u5206\u5757 (Small-to-Big)<\/h3>\n<ul>\n<li><strong>\u6838\u5fc3\u601d\u60f3<\/strong>\uff1a\u4f7f\u7528\u5c0f\u5757\uff08\u5982\u53e5\u5b50\uff09\u8fdb\u884c\u9ad8\u7cbe\u5ea6\u68c0\u7d22\uff0c\u7136\u540e\u5c06\u5305\u542b\u8be5\u5c0f\u5757\u7684\u539f\u59cb\u5927\u5757\uff08\u5982\u6bb5\u843d\uff09\u4f5c\u4e3a\u4e0a\u4e0b\u6587\u9001\u5165\u00a0<code>LLM<\/code>\u3002\u5b83\u7ed3\u5408\u4e86\u5c0f\u5757\u7684\u9ad8\u68c0\u7d22\u7cbe\u5ea6\u548c\u5927\u5757\u7684\u4e30\u5bcc\u4e0a\u4e0b\u6587\u3002<\/li>\n<li><strong>\u9002\u7528\u573a\u666f<\/strong>\uff1a\u9700\u8981\u9ad8\u68c0\u7d22\u7cbe\u5ea6\u548c\u4e30\u5bcc\u751f\u6210\u4e0a\u4e0b\u6587\u7684\u590d\u6742\u95ee\u7b54\u573a\u666f\u3002<\/li>\n<\/ul>\n<p>\u5728\u00a0<code>LangChain<\/code>\u00a0\u4e2d\uff0c<code>ParentDocumentRetriever<\/code>\u00a0\u5b9e\u73b0\u4e86\u8fd9\u4e00\u601d\u60f3\u3002\u5b83\u5728\u540e\u53f0\u7ba1\u7406\u4e24\u4e2a\u5e76\u884c\u7684\u5904\u7406\u6d41\u7a0b\uff1a<\/p>\n<ol>\n<li>\u5c06\u6587\u6863\u5206\u5272\u6210\u5927\u7684\u201c\u7236\u5757\u201d\u3002<\/li>\n<li>\u8fdb\u4e00\u6b65\u5c06\u6bcf\u4e2a\u7236\u5757\u5206\u5272\u6210\u5c0f\u7684\u201c\u5b50\u5757\u201d\u3002<\/li>\n<li>\u53ea\u5bf9\u5b50\u5757\u5efa\u7acb\u5411\u91cf\u7d22\u5f15\u3002<\/li>\n<li>\u68c0\u7d22\u65f6\uff0c\u9996\u5148\u627e\u5230\u76f8\u5173\u7684\u5b50\u5757\uff0c\u7136\u540e\u901a\u8fc7\u4e00\u4e2a\u72ec\u7acb\u7684\u00a0<code>docstore<\/code>\u00a0\u63d0\u53d6\u5b83\u4eec\u5bf9\u5e94\u7684\u7236\u5757\u8fd4\u56de\u7ed9\u00a0<code>LLM<\/code>\u3002<\/li>\n<\/ol>\n<pre><code># from langchain.embeddings import OpenAIEmbeddings\r\n# from langchain_text_splitters import RecursiveCharacterTextSplitter\r\n# from langchain.retrievers import ParentDocumentRetriever\r\n# from langchain_community.document_loaders import TextLoader\r\n# from langchain_chroma import Chroma\r\n# from langchain.storage import InMemoryStore\r\n# Assume 'docs' are loaded documents\r\n# parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)\r\n# child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)\r\n# vectorstore = Chroma(embedding_function=OpenAIEmbeddings(), collection_name=\"split_parents\")\r\n# store = InMemoryStore()  # This store holds the parent documents\r\n# retriever = ParentDocumentRetriever(\r\n#     vectorstore=vectorstore,\r\n#     docstore=store,\r\n#     child_splitter=child_splitter,\r\n#     parent_splitter=parent_splitter,\r\n# )\r\n# retriever.add_documents(docs)\r\n# sub_docs = vectorstore.similarity_search(\"query\") # Retrieves small chunks\r\n# retrieved_docs = retriever.get_relevant_documents(\"query\") # Retrieves large parent chunks\r\n# print(retrieved_docs.page_content)\r\n<\/code><\/pre>\n<h3>\u4ee3\u7406\u5f0f\u5206\u5757 (Agentic Chunking)<\/h3>\n<ul>\n<li><strong>\u6838\u5fc3\u601d\u60f3<\/strong>\uff1a\u5229\u7528\u4e00\u4e2a\u00a0<code>LLM<\/code>\u00a0<code>Agent<\/code>\u00a0\u6765\u6a21\u62df\u4eba\u7c7b\u7684\u9605\u8bfb\u7406\u89e3\u8fc7\u7a0b\uff0c\u52a8\u6001\u51b3\u5b9a\u5206\u5757\u8fb9\u754c\u3002\u4f8b\u5982\uff0c\u63d0\u793a\u00a0<code>LLM<\/code>\u00a0\u5c06\u4e00\u6bb5\u6587\u672c\u5206\u89e3\u4e3a\u591a\u4e2a\u201c\u81ea\u5305\u542b\u7684\u77e5\u8bc6\u5757\u201d\u3002<\/li>\n<li><strong>\u9002\u7528\u573a\u666f<\/strong>\uff1a\u5b9e\u9a8c\u6027\u9879\u76ee\uff0c\u6216\u5904\u7406\u9ad8\u5ea6\u590d\u6742\u3001\u975e\u7ed3\u6784\u5316\u7684\u6587\u672c\u3002\u6210\u672c\u6781\u9ad8\uff0c\u7a33\u5b9a\u6027\u5f85\u9a8c\u8bc1\u3002<\/li>\n<\/ul>\n<pre><code>import textwrap\r\n# from langchain_openai import ChatOpenAI\r\nfrom langchain.prompts import PromptTemplate\r\nfrom langchain_core.output_parsers import PydanticOutputParser\r\nfrom pydantic import BaseModel, Field\r\nfrom typing import List\r\nclass KnowledgeChunk(BaseModel):\r\nchunk_title: str = Field(description=\"A concise title for this knowledge chunk.\")\r\nchunk_text: str = Field(description=\"A self-contained text extracted and synthesized from the original paragraph.\")\r\nrepresentative_question: str = Field(description=\"A typical question that can be answered by this chunk.\")\r\nclass ChunkList(BaseModel):\r\nchunks: List[KnowledgeChunk]\r\nparser = PydanticOutputParser(pydantic_object=ChunkList)\r\nprompt_template = \"\"\"\r\n[ROLE]: You are a top-tier document analyst. Your task is to decompose complex text into a set of core, self-contained \"Knowledge Chunks\".\r\n[TASK]: Read the provided text, identify the distinct core concepts, and create a knowledge chunk for each.\r\n[RULES]:\r\n1. Self-Contained: Each chunk must be understandable on its own.\r\n2. Single Concept: Each chunk should focus on only one core idea.\r\n3. Extract and Restructure: Pull all relevant sentences for a concept and combine them into a coherent paragraph.\r\n4. Follow Format: Strictly adhere to the JSON format instructions below.\r\n{format_instructions}\r\n[TEXT TO PROCESS]:\r\n{paragraph_text}\r\n\"\"\"\r\nprompt = PromptTemplate(\r\ntemplate=prompt_template,\r\ninput_variables=[\"paragraph_text\"],\r\npartial_variables={\"format_instructions\": parser.get_format_instructions()},\r\n)\r\n# The following part is a simulation, as it requires a running LLM model.\r\n# model = ChatOpenAI(model=\"gpt-4\", temperature=0.0)\r\n# chain = prompt | model | parser\r\n# result = chain.invoke({\"paragraph_text\": document_text})\r\n<\/code><\/pre>\n<h2>\u6df7\u5408\u5206\u5757\uff1a\u5e73\u8861\u6548\u7387\u4e0e\u8d28\u91cf<\/h2>\n<p>\u5728\u5b9e\u8df5\u4e2d\uff0c\u5355\u4e00\u7b56\u7565\u96be\u4ee5\u5e94\u5bf9\u6240\u6709\u60c5\u51b5\u3002\u6df7\u5408\u5206\u5757\u662f\u4e00\u79cd\u975e\u5e38\u5b9e\u7528\u7684\u6280\u5de7\u3002<\/p>\n<ul>\n<li><strong>\u6838\u5fc3\u601d\u60f3<\/strong>\uff1a\u5148\u7528\u4e00\u79cd\u5b8f\u89c2\u7b56\u7565\uff08\u5982\u7ed3\u6784\u5316\u5206\u5757\uff09\u8fdb\u884c\u7c97\u7c92\u5ea6\u5207\u5206\uff0c\u518d\u5bf9\u8fc7\u5927\u7684\u5757\u4f7f\u7528\u66f4\u7cbe\u7ec6\u7684\u7b56\u7565\uff08\u5982\u9012\u5f52\u5206\u5757\uff09\u8fdb\u884c\u4e8c\u6b21\u5207\u5206\u3002<\/li>\n<li><strong>\u9002\u7528\u573a\u666f<\/strong>\uff1a\u5904\u7406\u7ed3\u6784\u590d\u6742\u4e14\u5185\u5bb9\u5bc6\u5ea6\u4e0d\u5747\u7684\u6587\u6863\u3002<\/li>\n<\/ul>\n<pre><code>from langchain_text_splitters import MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter\r\nfrom langchain_core.documents import Document\r\nmarkdown_document = \"\"\"\r\n# Chapter 1: Company Profile\r\nOur company was founded in 2017...\r\n## 1.1 Development History\r\nThe company has experienced rapid growth...\r\n# Chapter 2: Core Technology\r\nThis chapter describes our core technologies in detail. Our framework is based on advanced distributed computing concepts... (A very long paragraph with multiple sentences describing different technical aspects like CNNs, Transformers, data pipelines, etc.)\r\n## 2.1 Technical Principles\r\nOur principles combine statistics, machine learning...\r\n# Chapter 3: Future Outlook\r\nLooking ahead, we will <a href=\"https:\/\/www.kdjingpai.com\/de\/continue\/\">continue<\/a> to invest in AI...\r\n\"\"\"\r\ndef hybrid_chunking(\r\nmarkdown_document: str,\r\ncoarse_chunk_threshold: int = 400,\r\nfine_chunk_size: int = 100,\r\nfine_chunk_overlap: int = 20\r\n) -&gt; list[Document]:\r\n# 1. Coarse-grained splitting by structure\r\nheaders_to_split_on = [(\"#\", \"Header 1\"), (\"##\", \"Header 2\")]\r\nmarkdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)\r\ncoarse_chunks = markdown_splitter.split_text(markdown_document)\r\n# 2. Fine-grained recursive splitter for oversized chunks\r\nfine_splitter = RecursiveCharacterTextSplitter(\r\nchunk_size=fine_chunk_size,\r\nchunk_overlap=fine_chunk_overlap\r\n)\r\nfinal_chunks = []\r\nfor chunk in coarse_chunks:\r\nif len(chunk.page_content) &gt; coarse_chunk_threshold:\r\n# If a chunk is too large, split it further\r\nfiner_chunks = fine_splitter.split_documents([chunk])\r\nfinal_chunks.extend(finer_chunks)\r\nelse:\r\nfinal_chunks.append(chunk)\r\nreturn final_chunks\r\nfinal_chunks = hybrid_chunking(markdown_document)\r\nfor i, chunk in enumerate(final_chunks):\r\nprint(f\"--- Final Chunk {i+1} (Length: {len(chunk.page_content)}) ---\")\r\nprint(f\"Metadata: {chunk.metadata}\")\r\nprint(chunk.page_content)\r\nprint(\"-\" * 80)\r\n<\/code><\/pre>\n<h2>\u5982\u4f55\u9009\u62e9\u6700\u4f73\u5206\u5757\u7b56\u7565\uff1f<\/h2>\n<p>\u9762\u5bf9\u4f17\u591a\u7b56\u7565\uff0c\u5408\u7406\u7684\u9009\u62e9\u8def\u5f84\u6bd4\u9010\u4e00\u5c1d\u8bd5\u66f4\u91cd\u8981\u3002\u5efa\u8bae\u9075\u5faa\u4ee5\u4e0b\u5206\u5c42\u51b3\u7b56\u6846\u67b6\u3002<\/p>\n<p><strong>\u7b2c\u4e00\u6b65\uff1a\u4ece\u57fa\u51c6\u7b56\u7565\u5f00\u59cb<\/strong><\/p>\n<ul>\n<li><strong>\u9ed8\u8ba4\u9009\u9879<\/strong>\uff1a<code>RecursiveCharacterTextSplitter<\/code>\u3002\u65e0\u8bba\u5904\u7406\u4f55\u79cd\u6587\u672c\uff0c\u8fd9\u90fd\u662f\u6700\u7a33\u59a5\u7684\u8d77\u70b9\u3002\u7528\u5b83\u5efa\u7acb\u4e00\u4e2a\u6027\u80fd\u57fa\u7ebf\u3002<\/li>\n<\/ul>\n<p><strong>\u7b2c\u4e8c\u6b65\uff1a\u68c0\u67e5\u7ed3\u6784\u5316\u7279\u5f81<\/strong><\/p>\n<ul>\n<li><strong>\u4f18\u5148\u9009\u9879<\/strong>\uff1a\u7ed3\u6784\u611f\u77e5\u5206\u5757\u3002\u5982\u679c\u6587\u6863\u5177\u6709\u660e\u786e\u7684\u7ed3\u6784\uff08<code>Markdown<\/code>\u00a0\u6807\u9898\u3001<code>HTML<\/code>\u00a0\u6807\u7b7e\uff09\uff0c\u5207\u6362\u5230\u00a0<code>MarkdownHeaderTextSplitter<\/code>\u00a0\u7b49\u65b9\u6cd5\u3002\u8fd9\u662f\u6210\u672c\u6700\u4f4e\u3001\u6536\u76ca\u6700\u660e\u663e\u7684\u4f18\u5316\u3002<\/li>\n<\/ul>\n<p><strong>\u7b2c\u4e09\u6b65\uff1a\u5f53\u7cbe\u5ea6\u6210\u4e3a\u74f6\u9888\u65f6<\/strong><\/p>\n<ul>\n<li><strong>\u8fdb\u9636\u9009\u9879<\/strong>\uff1a\u8bed\u4e49\u5206\u5757\u6216\u5c0f-\u5927\u5206\u5757\u3002\u5982\u679c\u57fa\u7840\u548c\u7ed3\u6784\u5316\u7b56\u7565\u7684\u68c0\u7d22\u6548\u679c\u4ecd\u4e0d\u7406\u60f3\uff0c\u8bf4\u660e\u9700\u8981\u66f4\u9ad8\u7ef4\u5ea6\u7684\u8bed\u4e49\u4fe1\u606f\u3002\n<ul>\n<li><code>SemanticChunker<\/code>\uff1a\u9002\u7528\u4e8e\u9700\u8981\u5757\u5185\u8bed\u4e49\u9ad8\u5ea6\u4e00\u81f4\u7684\u573a\u666f\u3002<\/li>\n<li><code>ParentDocumentRetriever<\/code>\u00a0(\u5c0f-\u5927\u5206\u5757)\uff1a\u9002\u7528\u4e8e\u65e2\u8981\u4fdd\u8bc1\u68c0\u7d22\u7cbe\u51c6\u5ea6\uff0c\u53c8\u9700\u8981\u4e3a\u00a0<code>LLM<\/code>\u00a0\u63d0\u4f9b\u5b8c\u6574\u4e0a\u4e0b\u6587\u7684\u590d\u6742\u95ee\u7b54\u573a\u666f\u3002<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p><strong>\u7b2c\u56db\u6b65\uff1a\u5e94\u5bf9\u6781\u7aef\u590d\u6742\u7684\u6587\u6863<\/strong><\/p>\n<ul>\n<li><strong>\u9ad8\u7ea7\u5b9e\u8df5<\/strong>\uff1a\u6df7\u5408\u5206\u5757\u3002\u5bf9\u4e8e\u7ed3\u6784\u590d\u6742\u3001\u5185\u5bb9\u5bc6\u5ea6\u4e0d\u5747\u7684\u6587\u6863\uff0c\u6df7\u5408\u5206\u5757\u662f\u5e73\u8861\u6210\u672c\u4e0e\u6548\u679c\u7684\u6700\u4f73\u5b9e\u8df5\u3002<\/li>\n<\/ul>\n<p>\u4e0b\u8868\u603b\u7ed3\u4e86\u6240\u6709\u8ba8\u8bba\u8fc7\u7684\u5206\u5757\u7b56\u7565\u3002<\/p>\n<table>\n<thead>\n<tr>\n<th align=\"left\">\u5206\u5757\u7b56\u7565<\/th>\n<th align=\"left\">\u6838\u5fc3\u903b\u8f91<\/th>\n<th align=\"left\">\u4f18\u70b9<\/th>\n<th align=\"left\">\u7f3a\u70b9\u4e0e\u6210\u672c<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td align=\"left\"><strong>\u56fa\u5b9a\u957f\u5ea6\u5206\u5757<\/strong><\/td>\n<td align=\"left\">\u6309\u56fa\u5b9a\u5b57\u7b26\u6570\u6216\u00a0<code>token<\/code>\u00a0\u6570\u5207\u5206<\/td>\n<td align=\"left\">\u5b9e\u73b0\u7b80\u5355\uff0c\u901f\u5ea6\u5feb<\/td>\n<td align=\"left\">\u5bb9\u6613\u7834\u574f\u8bed\u4e49\uff0c\u6548\u679c\u6700\u5dee<\/td>\n<\/tr>\n<tr>\n<td align=\"left\"><strong>\u9012\u5f52\u5206\u5757<\/strong><\/td>\n<td align=\"left\">\u6309\u9884\u5b9a\u5206\u9694\u7b26\uff08\u6bb5\u843d\u3001\u53e5\u5b50\uff09\u9012\u5f52\u5207\u5206<\/td>\n<td align=\"left\">\u901a\u7528\u6027\u5f3a\uff0c\u8f83\u597d\u5730\u4fdd\u7559\u7ed3\u6784<\/td>\n<td align=\"left\">\u5bf9\u65e0\u89c4\u5f8b\u6587\u6863\u6548\u679c\u4e00\u822c<\/td>\n<\/tr>\n<tr>\n<td align=\"left\"><strong>\u57fa\u4e8e\u53e5\u5b50\u7684\u5206\u5757<\/strong><\/td>\n<td align=\"left\">\u4ee5\u53e5\u5b50\u4e3a\u6700\u5c0f\u5355\u5143\uff0c\u518d\u7ec4\u5408\u6210\u5757<\/td>\n<td align=\"left\">\u4fdd\u8bc1\u53e5\u5b50\u5b8c\u6574\u6027<\/td>\n<td align=\"left\">\u5355\u53e5\u4e0a\u4e0b\u6587\u53ef\u80fd\u4e0d\u8db3\uff0c\u9700\u5904\u7406\u957f\u53e5<\/td>\n<\/tr>\n<tr>\n<td align=\"left\"><strong>\u7ed3\u6784\u5316\u5206\u5757<\/strong><\/td>\n<td align=\"left\">\u5229\u7528\u6587\u6863\u56fa\u6709\u7ed3\u6784\uff08\u5982\u6807\u9898\uff09\u5207\u5206<\/td>\n<td align=\"left\">\u903b\u8f91\u6027\u5f3a\uff0c\u4e0a\u4e0b\u6587\u6e05\u6670<\/td>\n<td align=\"left\">\u5f3a\u4f9d\u8d56\u6587\u6863\u683c\u5f0f\uff0c\u4e0d\u901a\u7528<\/td>\n<\/tr>\n<tr>\n<td align=\"left\"><strong>\u8bed\u4e49\u5206\u5757<\/strong><\/td>\n<td align=\"left\">\u6839\u636e\u5c40\u90e8\u8bed\u4e49\u76f8\u4f3c\u5ea6\u53d8\u5316\u5207\u5206<\/td>\n<td align=\"left\">\u5757\u5185\u6982\u5ff5\u9ad8\u5ea6\u5185\u805a\uff0c\u68c0\u7d22\u7cbe\u5ea6\u9ad8<\/td>\n<td align=\"left\">\u8ba1\u7b97\u6210\u672c\u9ad8\uff08<code>Embedding<\/code>\u00a0\u8ba1\u7b97\uff09\uff0c\u4f9d\u8d56\u6a21\u578b\u8d28\u91cf<\/td>\n<\/tr>\n<tr>\n<td align=\"left\"><strong>\u57fa\u4e8e\u4e3b\u9898\u7684\u5206\u5757<\/strong><\/td>\n<td align=\"left\">\u5229\u7528\u4e3b\u9898\u6a21\u578b\u6309\u5168\u5c40\u4e3b\u9898\u8fb9\u754c\u5207\u5206<\/td>\n<td align=\"left\">\u5757\u5185\u4fe1\u606f\u9ad8\u5ea6\u76f8\u5173<\/td>\n<td align=\"left\">\u5b9e\u73b0\u590d\u6742\uff0c\u5bf9\u6570\u636e\u548c\u53c2\u6570\u654f\u611f\uff0c\u6548\u679c\u4e0d\u7a33\u5b9a<\/td>\n<\/tr>\n<tr>\n<td align=\"left\"><strong>\u6df7\u5408\u5206\u5757<\/strong><\/td>\n<td align=\"left\">\u5b8f\u89c2\u7c97\u5206 + \u5fae\u89c2\u7ec6\u5206<\/td>\n<td align=\"left\">\u5e73\u8861\u6548\u7387\u4e0e\u8d28\u91cf\uff0c\u5b9e\u7528\u6027\u5f3a<\/td>\n<td align=\"left\">\u5b9e\u73b0\u903b\u8f91\u66f4\u590d\u6742<\/td>\n<\/tr>\n<tr>\n<td align=\"left\"><strong>\u5c0f-\u5927\u5206\u5757<\/strong><\/td>\n<td align=\"left\">\u68c0\u7d22\u7528\u5c0f\u5757\uff0c\u751f\u6210\u7528\u5927\u5757<\/td>\n<td align=\"left\">\u7ed3\u5408\u9ad8\u7cbe\u5ea6\u68c0\u7d22\u548c\u4e30\u5bcc\u4e0a\u4e0b\u6587<\/td>\n<td align=\"left\">\u7ba1\u9053\u590d\u6742\uff0c\u9700\u8981\u7ba1\u7406\u4e24\u5957\u7d22\u5f15\uff0c\u5b58\u50a8\u6210\u672c\u7ffb\u500d<\/td>\n<\/tr>\n<tr>\n<td align=\"left\"><strong>\u4ee3\u7406\u5f0f\u5206\u5757<\/strong><\/td>\n<td align=\"left\"><code>AI<\/code>\u00a0\u4ee3\u7406\u52a8\u6001\u5206\u6790\u548c\u5207\u5206\u6587\u6863<\/td>\n<td align=\"left\">\u7406\u8bba\u4e0a\u6548\u679c\u6700\u4f18<\/td>\n<td align=\"left\">\u5b9e\u9a8c\u6027\uff0c\u6210\u672c\u6781\u9ad8\uff08<code>API<\/code>\u00a0\u8c03\u7528\uff09\uff0c\u5ef6\u8fdf\u5927<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n","protected":false},"excerpt":{"rendered":"<p>\u4e00\u4e2a\u73b0\u8c61\u5f88\u5e38\u89c1\uff1a\u5373\u4f7f RAG \u7cfb\u7edf\u7528\u4e86\u6700\u5f3a\u7684 LLM\uff0cPrompt \u4e5f\u7ecf\u8fc7\u4e86\u53cd\u590d\u8c03\u6821\uff0c\u95ee\u7b54\u6548\u679c\u4f9d\u7136\u4e0d\u7406\u60f3\uff0c\u7b54\u6848\u8981\u4e48\u4e0a\u4e0b\u6587\u4e0d\u5168\uff0c\u8981\u4e48\u5b58\u5728\u4e8b\u5b9e\u9519\u8bef\u3002 \u5de5\u7a0b\u5e08\u4eec\u68c0\u67e5\u4e86\u68c0\u7d22\u7b97\u6cd5\uff0c\u4f18\u5316\u4e86\u00a0Embedding\u00a0\u6a21\u578b\uff0c\u4f46\u5e38\u5e38\u5ffd\u7565\u4e86\u6570\u636e\u8fdb\u5165\u5411\u91cf\u5e93\u4e4b\u524d\u7684\u5173\u952e&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[182],"tags":[],"class_list":["post-43400","post","type-post","status-publish","format-standard","hentry","category-shicao"],"_links":{"self":[{"href":"https:\/\/www.kdjingpai.com\/de\/wp-json\/wp\/v2\/posts\/43400","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.kdjingpai.com\/de\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.kdjingpai.com\/de\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.kdjingpai.com\/de\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.kdjingpai.com\/de\/wp-json\/wp\/v2\/comments?post=43400"}],"version-history":[{"count":0,"href":"https:\/\/www.kdjingpai.com\/de\/wp-json\/wp\/v2\/posts\/43400\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.kdjingpai.com\/de\/wp-json\/wp\/v2\/media?parent=43400"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.kdjingpai.com\/de\/wp-json\/wp\/v2\/categories?post=43400"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.kdjingpai.com\/de\/wp-json\/wp\/v2\/tags?post=43400"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}