{"id":31788,"date":"2025-06-27T19:29:13","date_gmt":"2025-06-27T11:29:13","guid":{"rendered":"https:\/\/www.kdjingpai.com\/?p=31788"},"modified":"2025-06-27T19:29:13","modified_gmt":"2025-06-27T11:29:13","slug":"congrumendaojingtong","status":"publish","type":"post","link":"https:\/\/www.kdjingpai.com\/ja\/congrumendaojingtong\/","title":{"rendered":"\u4ece\u5165\u95e8\u5230\u7cbe\u901a\uff1a\u6df1\u5165\u89e3\u6790\u5f3a\u5316\u5b66\u4e60\u4e0e GRPO \u6a21\u578b\u8bad\u7ec3"},"content":{"rendered":"<p>\u5b66\u4e60\u5173\u4e8e\u5f3a\u5316\u5b66\u4e60 (RL) \u7684\u6240\u6709\u77e5\u8bc6\uff0c\u4ee5\u53ca\u5982\u4f55\u4f7f\u7528 <a href=\"https:\/\/www.kdjingpai.com\/en\/unsloth\/\">Unsloth<\/a> \u548c <a href=\"https:\/\/www.kdjingpai.com\/en\/grpo-ruhezaishi\/\">GRPO<\/a> \u8bad\u7ec3\u4f60\u81ea\u5df1\u7684 <a href=\"https:\/\/www.kdjingpai.com\/en\/deepseek-r1nenglixiang\/\">DeepSeek-R1<\/a> \u63a8\u7406\u6a21\u578b\u3002\u4e00\u4efd\u4ece\u5165\u95e8\u5230\u7cbe\u901a\u7684\u5b8c\u6574\u6307\u5357\u3002<\/p>\n<h2>\ud83e\udda5 \u4f60\u5c06\u5b66\u5230\u4ec0\u4e48<\/h2>\n<ol>\n<li>\u4ec0\u4e48\u662f RL\uff1fRLVR\uff1fPPO\uff1fGRPO\uff1fRLHF\uff1fRFT\uff1f\u5f3a\u5316\u5b66\u4e60\u662f\u5426\u771f\u7684**\u201c\u8fd0\u6c14\u5c31\u662f\u4f60\u6240\u9700\u8981\u7684\u4e00\u5207\uff1f\u201d**<\/li>\n<li>\u4ec0\u4e48\u662f\u73af\u5883\uff1f\u667a\u80fd\u4f53\uff1f\u52a8\u4f5c\uff1f\u5956\u52b1\u51fd\u6570\uff1f\u5956\u52b1\uff1f<\/li>\n<\/ol>\n<p>\u672c\u6587\u6db5\u76d6\u4e86\u4f60\u9700\u8981\u4e86\u89e3\u7684\u5173\u4e8e GRPO\u3001\u5f3a\u5316\u5b66\u4e60 (RL) \u548c\u5956\u52b1\u51fd\u6570\u7684\u6240\u6709\u5185\u5bb9 (\u4ece\u5165\u95e8\u5230\u9ad8\u7ea7)\uff0c\u4ee5\u53ca\u4e00\u4e9b\u6280\u5de7\u548c\u4f7f\u7528\u00a0Unsloth\u00a0\u8fdb\u884c GRPO \u7684\u57fa\u7840\u77e5\u8bc6\u3002\u5982\u679c\u4f60\u6b63\u5728\u5bfb\u627e\u4f7f\u7528 GRPO \u7684\u5206\u6b65\u6559\u7a0b\uff0c\u8bf7\u53c2\u9605\u6211\u4eec\u7684\u6307\u5357\u8fd9\u91cc\u3002<\/p>\n<h2>\u2753 \u4ec0\u4e48\u662f\u5f3a\u5316\u5b66\u4e60 (RL)\uff1f<\/h2>\n<p>RL \u7684\u76ee\u6807\u662f\uff1a<\/p>\n<ol>\n<li><strong>\u589e\u52a0\u770b\u5230\u201c\u597d\u201d\u7ed3\u679c\u7684\u673a\u4f1a\u3002<\/strong><\/li>\n<li><strong>\u51cf\u5c11\u770b\u5230\u201c\u574f\u201d\u7ed3\u679c\u7684\u673a\u4f1a\u3002<\/strong><\/li>\n<\/ol>\n<p>**\u5c31\u662f\u8fd9\u6837\uff01**\u5173\u4e8e\u201c\u597d\u201d\u4e0e\u201c\u574f\u201d\u7684\u542b\u4e49\uff0c\u6216\u8005\u6211\u4eec\u5982\u4f55\u53bb\u201c\u589e\u52a0\u201d\u6216\u201c\u51cf\u5c11\u201d\u5b83\uff0c\u751a\u81f3\u201c\u7ed3\u679c\u201d\u610f\u5473\u7740\u4ec0\u4e48\uff0c\u90fd\u5b58\u5728\u4e00\u4e9b\u590d\u6742\u4e4b\u5904\u3002<\/p>\n<p>\u4f8b\u5982\uff0c\u5728<strong>\u5403\u8c46\u4eba\u6e38\u620f<\/strong>\u00a0(Pacman game) \u4e2d\uff1a<\/p>\n<ol>\n<li><strong>\u73af\u5883<\/strong>\u00a0(environment) \u662f\u6e38\u620f\u4e16\u754c\u3002<\/li>\n<li>\u4f60\u53ef\u4ee5\u91c7\u53d6\u7684<strong>\u52a8\u4f5c<\/strong>\u00a0(actions) \u662f\u4e0a\u3001\u5de6\u3001\u53f3\u548c\u4e0b\u3002<\/li>\n<li>\u5982\u679c\u4f60\u5403\u5230\u4e00\u4e2a\u8c46\u5b50\uff0c<strong>\u5956\u52b1<\/strong>\u00a0(rewards) \u5c31\u662f\u597d\u7684\uff1b\u5982\u679c\u4f60\u649e\u5230\u90a3\u4e9b\u626d\u52a8\u7684\u654c\u4eba\uff0c\u5956\u52b1\u5c31\u662f\u574f\u7684\u3002<\/li>\n<li>\u5728 RL \u4e2d\uff0c\u4f60\u65e0\u6cd5\u77e5\u9053\u53ef\u4ee5\u91c7\u53d6\u7684\u201c\u6700\u4f73\u52a8\u4f5c\u201d\uff0c\u4f46\u4f60\u53ef\u4ee5\u89c2\u5bdf\u4e2d\u95f4\u6b65\u9aa4\u6216\u6700\u7ec8\u7684\u6e38\u620f\u72b6\u6001 (\u8d62\u6216\u8f93)\u3002<\/li>\n<\/ol>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-31789\" title=\"\u4ece\u5165\u95e8\u5230\u7cbe\u901a\uff1a\u6df1\u5165\u89e3\u6790\u5f3a\u5316\u5b66\u4e60\u4e0e GRPO \u6a21\u578b\u8bad\u7ec3-1\" src=\"https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/06\/2f9094bc240db29.jpg\" alt=\"\u4ece\u5165\u95e8\u5230\u7cbe\u901a\uff1a\u6df1\u5165\u89e3\u6790\u5f3a\u5316\u5b66\u4e60\u4e0e GRPO \u6a21\u578b\u8bad\u7ec3-1\" width=\"1200\" height=\"709\" srcset=\"https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/06\/2f9094bc240db29.jpg 1200w, https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/06\/2f9094bc240db29-18x12.jpg 18w\" sizes=\"auto, (max-width: 1200px) 100vw, 1200px\" \/><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-31790\" title=\"\u4ece\u5165\u95e8\u5230\u7cbe\u901a\uff1a\u6df1\u5165\u89e3\u6790\u5f3a\u5316\u5b66\u4e60\u4e0e GRPO \u6a21\u578b\u8bad\u7ec3-1\" src=\"https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/06\/ecbbc04c47607c2.jpg\" alt=\"\u4ece\u5165\u95e8\u5230\u7cbe\u901a\uff1a\u6df1\u5165\u89e3\u6790\u5f3a\u5316\u5b66\u4e60\u4e0e GRPO \u6a21\u578b\u8bad\u7ec3-1\" width=\"1200\" height=\"674\" srcset=\"https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/06\/ecbbc04c47607c2.jpg 1200w, https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/06\/ecbbc04c47607c2-18x10.jpg 18w\" sizes=\"auto, (max-width: 1200px) 100vw, 1200px\" \/><\/p>\n<p>\u53e6\u4e00\u4e2a\u4f8b\u5b50\u662f\uff0c\u60f3\u8c61\u4f60\u88ab\u95ee\u5230\u8fd9\u6837\u4e00\u4e2a\u95ee\u9898\uff1a<strong>\u201c2 + 2 \u7b49\u4e8e\u51e0\uff1f\u201d<\/strong>\u00a0(4) \u4e00\u4e2a\u672a\u5bf9\u9f50\u7684\u8bed\u8a00\u6a21\u578b\u4f1a\u5410\u51fa 3\u30014\u3001C\u3001D\u3001-10\uff0c\u4efb\u4f55\u4e1c\u897f\u90fd\u6709\u53ef\u80fd\u3002<\/p>\n<ol>\n<li>\u6570\u5b57\u603b\u6bd4 C \u6216 D \u597d\uff0c\u5bf9\u5427\uff1f<\/li>\n<li>\u5f97\u5230 3 \u603b\u6bd4\u5f97\u5230 8 \u597d\uff0c\u5bf9\u5427\uff1f<\/li>\n<li>\u5f97\u5230 4 \u7edd\u5bf9\u662f\u6b63\u786e\u7684\u3002<\/li>\n<\/ol>\n<p>\u6211\u4eec\u521a\u521a\u8bbe\u8ba1\u4e86\u4e00\u4e2a<strong>\u5956\u52b1\u51fd\u6570<\/strong>\u00a0(reward function)\uff01<\/p>\n<h2>\ud83c\udfc3 \u4ece RLHF\u3001PPO \u5230 GRPO \u548c RLVR<\/h2>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-31792\" title=\"\u4ece\u5165\u95e8\u5230\u7cbe\u901a\uff1a\u6df1\u5165\u89e3\u6790\u5f3a\u5316\u5b66\u4e60\u4e0e GRPO \u6a21\u578b\u8bad\u7ec3-1\" src=\"https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/06\/d40236b8edb2fdb.jpg\" alt=\"\u4ece\u5165\u95e8\u5230\u7cbe\u901a\uff1a\u6df1\u5165\u89e3\u6790\u5f3a\u5316\u5b66\u4e60\u4e0e GRPO \u6a21\u578b\u8bad\u7ec3-1\" width=\"1200\" height=\"674\" srcset=\"https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/06\/d40236b8edb2fdb.jpg 1200w, https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/06\/d40236b8edb2fdb-18x10.jpg 18w\" sizes=\"auto, (max-width: 1200px) 100vw, 1200px\" \/><\/p>\n<p>OpenAI \u63a8\u5e7f\u4e86\u00a0RLHF\u00a0(\u6765\u81ea\u4eba\u7c7b\u53cd\u9988\u7684\u5f3a\u5316\u5b66\u4e60) \u7684\u6982\u5ff5\uff0c\u6211\u4eec\u8bad\u7ec3\u4e00\u4e2a**\u201c\u667a\u80fd\u4f53\u201d** (agent) \u5bf9\u4e00\u4e2a\u95ee\u9898 (\u5373<strong>\u72b6\u6001<\/strong>\u00a0(state)) \u751f\u6210\u88ab\u4eba\u7c7b\u8bc4\u4ef7\u4e3a\u66f4\u6709\u7528\u7684\u8f93\u51fa\u3002<\/p>\n<p>\u4f8b\u5982\uff0cChatGPT \u4e2d\u7684\u70b9\u8d5e\u548c\u70b9\u8e29\u5c31\u53ef\u4ee5\u7528\u4e8e RLHF \u8fc7\u7a0b\u3002<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-31791\" title=\"\u4ece\u5165\u95e8\u5230\u7cbe\u901a\uff1a\u6df1\u5165\u89e3\u6790\u5f3a\u5316\u5b66\u4e60\u4e0e GRPO \u6a21\u578b\u8bad\u7ec3-1\" src=\"https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/06\/d7ce89f10b8f64d.jpg\" alt=\"\u4ece\u5165\u95e8\u5230\u7cbe\u901a\uff1a\u6df1\u5165\u89e3\u6790\u5f3a\u5316\u5b66\u4e60\u4e0e GRPO \u6a21\u578b\u8bad\u7ec3-1\" width=\"1200\" height=\"682\" srcset=\"https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/06\/d7ce89f10b8f64d.jpg 1200w, https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/06\/d7ce89f10b8f64d-18x10.jpg 18w\" sizes=\"auto, (max-width: 1200px) 100vw, 1200px\" \/><\/p>\n<p>&nbsp;<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-31793\" title=\"\u4ece\u5165\u95e8\u5230\u7cbe\u901a\uff1a\u6df1\u5165\u89e3\u6790\u5f3a\u5316\u5b66\u4e60\u4e0e GRPO \u6a21\u578b\u8bad\u7ec3-1\" src=\"https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/06\/95d9ce9f0314cc5.jpg\" alt=\"\u4ece\u5165\u95e8\u5230\u7cbe\u901a\uff1a\u6df1\u5165\u89e3\u6790\u5f3a\u5316\u5b66\u4e60\u4e0e GRPO \u6a21\u578b\u8bad\u7ec3-1\" width=\"1200\" height=\"172\" srcset=\"https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/06\/95d9ce9f0314cc5.jpg 1200w, https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/06\/95d9ce9f0314cc5-18x3.jpg 18w\" sizes=\"auto, (max-width: 1200px) 100vw, 1200px\" \/><br \/>\n<em>PPO \u516c\u5f0f<\/em><\/p>\n<p><code>clip(..., 1-e, 1+e)<\/code>\u00a0\u9879\u7528\u4e8e\u5f3a\u5236 PPO \u4e0d\u8fdb\u884c\u8fc7\u5927\u7684\u6539\u52a8\u3002\u8fd8\u6709\u4e00\u4e2a KL \u9879\uff0c\u5176 beta \u8bbe\u7f6e\u4e3a &gt; 0\uff0c\u4ee5\u5f3a\u5236\u6a21\u578b\u4e0d\u8981\u504f\u79bb\u592a\u8fdc\u3002<\/p>\n<p>\u4e3a\u4e86\u5b9e\u73b0 RLHF\uff0c<strong>PPO<\/strong>\u00a0(\u8fd1\u7aef\u7b56\u7565\u4f18\u5316) \u88ab\u5f00\u53d1\u51fa\u6765\u3002\u5728\u8fd9\u79cd\u60c5\u51b5\u4e0b\uff0c<strong>\u667a\u80fd\u4f53<\/strong>\u5c31\u662f\u8bed\u8a00\u6a21\u578b\u3002\u5b9e\u9645\u4e0a\uff0c\u5b83\u7531 3 \u4e2a\u7cfb\u7edf\u7ec4\u6210\uff1a<\/p>\n<ol>\n<li><strong>\u751f\u6210\u7b56\u7565 (\u5f53\u524d\u8bad\u7ec3\u7684\u6a21\u578b)<\/strong><\/li>\n<li><strong>\u53c2\u8003\u7b56\u7565 (\u539f\u59cb\u6a21\u578b)<\/strong><\/li>\n<li><strong>\u4ef7\u503c\u6a21\u578b (\u5e73\u5747\u5956\u52b1\u4f30\u8ba1\u5668)<\/strong><\/li>\n<\/ol>\n<p>\u6211\u4eec\u4f7f\u7528<strong>\u5956\u52b1\u6a21\u578b<\/strong>\u6765\u8ba1\u7b97\u5f53\u524d\u73af\u5883\u7684\u5956\u52b1\uff0c\u6211\u4eec\u7684\u76ee\u6807\u662f<strong>\u6700\u5927\u5316\u8fd9\u4e2a\u5956\u52b1<\/strong>\uff01<\/p>\n<p>PPO \u7684\u516c\u5f0f\u770b\u8d77\u6765\u76f8\u5f53\u590d\u6742\uff0c\u56e0\u4e3a\u5b83\u88ab\u8bbe\u8ba1\u4e3a\u7a33\u5b9a\u7684\u3002\u8bf7\u8bbf\u95ee\u6211\u4eec\u5728 2025 \u5e74\u4e3e\u529e\u7684\u5173\u4e8e RL \u7684\u00a0AI \u5de5\u7a0b\u5e08\u8bb2\u5ea7\uff0c\u4e86\u89e3\u66f4\u591a\u5173\u4e8e PPO \u7684\u6df1\u5165\u6570\u5b66\u63a8\u5bfc\u3002<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-31794\" title=\"\u4ece\u5165\u95e8\u5230\u7cbe\u901a\uff1a\u6df1\u5165\u89e3\u6790\u5f3a\u5316\u5b66\u4e60\u4e0e GRPO \u6a21\u578b\u8bad\u7ec3-1\" src=\"https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/06\/f253022185f212a.jpg\" alt=\"\u4ece\u5165\u95e8\u5230\u7cbe\u901a\uff1a\u6df1\u5165\u89e3\u6790\u5f3a\u5316\u5b66\u4e60\u4e0e GRPO \u6a21\u578b\u8bad\u7ec3-1\" width=\"1200\" height=\"682\" srcset=\"https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/06\/f253022185f212a.jpg 1200w, https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/06\/f253022185f212a-18x10.jpg 18w\" sizes=\"auto, (max-width: 1200px) 100vw, 1200px\" \/><\/p>\n<p><a href=\"https:\/\/www.kdjingpai.com\/en\/deepseek-chatshena\/\">DeepSeek<\/a> \u5f00\u53d1\u4e86\u00a0<strong>GRPO<\/strong>\u00a0(\u7ec4\u76f8\u5bf9\u7b56\u7565\u4f18\u5316) \u6765\u8bad\u7ec3\u4ed6\u4eec\u7684 R1 \u63a8\u7406\u6a21\u578b\u3002\u4e0e PPO \u7684\u4e3b\u8981\u533a\u522b\u5728\u4e8e\uff1a<\/p>\n<ol>\n<li>**\u4ef7\u503c\u6a21\u578b\u88ab\u79fb\u9664\uff0c**\u53d6\u800c\u4ee3\u4e4b\u7684\u662f\u591a\u6b21\u8c03\u7528\u5956\u52b1\u6a21\u578b\u5f97\u51fa\u7684\u7edf\u8ba1\u6570\u636e\u3002<\/li>\n<li><strong>\u5956\u52b1\u6a21\u578b\u88ab\u79fb\u9664<\/strong>\uff0c\u5e76\u66ff\u6362\u4e3a\u53ef\u4f7f\u7528\u00a0<strong>RLVR<\/strong>\u00a0\u7684\u81ea\u5b9a\u4e49\u5956\u52b1\u51fd\u6570\u3002<\/li>\n<\/ol>\n<p>\u8fd9\u610f\u5473\u7740 GRPO \u975e\u5e38\u9ad8\u6548\u3002\u4ee5\u524d PPO \u9700\u8981\u8bad\u7ec3\u591a\u4e2a\u6a21\u578b\u2014\u2014\u73b0\u5728\u79fb\u9664\u4e86\u5956\u52b1\u6a21\u578b\u548c\u4ef7\u503c\u6a21\u578b\uff0c\u6211\u4eec\u53ef\u4ee5\u8282\u7701\u5185\u5b58\u5e76\u52a0\u901f\u4e00\u5207\u3002<\/p>\n<p><strong>RLVR (\u53ef\u9a8c\u8bc1\u5956\u52b1\u7684\u5f3a\u5316\u5b66\u4e60)<\/strong>\u00a0\u5141\u8bb8\u6211\u4eec\u6839\u636e\u5177\u6709\u6613\u4e8e\u9a8c\u8bc1\u89e3\u51b3\u65b9\u6848\u7684\u4efb\u52a1\u6765\u5956\u52b1\u6a21\u578b\u3002\u4f8b\u5982\uff1a<\/p>\n<ol>\n<li>\u6570\u5b66\u65b9\u7a0b\u53ef\u4ee5\u8f7b\u677e\u9a8c\u8bc1\u3002\u4f8b\u5982 2+2 = 4\u3002<\/li>\n<li>\u4ee3\u7801\u8f93\u51fa\u53ef\u4ee5\u88ab\u9a8c\u8bc1\u662f\u5426\u6b63\u786e\u6267\u884c\u3002<\/li>\n<li>\u8bbe\u8ba1\u53ef\u9a8c\u8bc1\u7684\u5956\u52b1\u51fd\u6570\u53ef\u80fd\u5f88\u56f0\u96be\uff0c\u56e0\u6b64\u5927\u591a\u6570\u4f8b\u5b50\u90fd\u662f\u6570\u5b66\u6216\u4ee3\u7801\u3002<\/li>\n<li>GRPO \u7684\u7528\u4f8b\u4e0d\u4ec5\u9650\u4e8e\u4ee3\u7801\u6216\u6570\u5b66\u2014\u2014\u5176\u63a8\u7406\u8fc7\u7a0b\u53ef\u4ee5\u589e\u5f3a\u8bf8\u5982\u7535\u5b50\u90ae\u4ef6\u81ea\u52a8\u5316\u3001\u6570\u636e\u5e93\u68c0\u7d22\u3001\u6cd5\u5f8b\u548c\u533b\u5b66\u7b49\u4efb\u52a1\uff0c\u6839\u636e\u4f60\u7684\u6570\u636e\u96c6\u548c\u5956\u52b1\u51fd\u6570\u6781\u5927\u5730\u63d0\u9ad8\u51c6\u786e\u6027\u2014\u2014\u8bc0\u7a8d\u5728\u4e8e\u5b9a\u4e49\u4e00\u4e2a<strong>\u8bc4\u5206\u6807\u51c6<\/strong>\u00a0(rubric)\u2014\u2014\u5373\u4e00\u7cfb\u5217\u8f83\u5c0f\u7684\u53ef\u9a8c\u8bc1\u5956\u52b1\uff0c\u800c\u4e0d\u662f\u4e00\u4e2a\u6700\u7ec8\u7684\u3001\u5305\u7f57\u4e07\u8c61\u7684\u5355\u4e00\u5956\u52b1\u3002\u4f8b\u5982\uff0cOpenAI \u5728\u5176\u00a0\u5f3a\u5316\u5b66\u4e60\u5fae\u8c03 (RFT)\u00a0\u670d\u52a1\u4e2d\u63a8\u5e7f\u4e86\u8fd9\u4e00\u70b9\u3002<\/li>\n<\/ol>\n<p><strong>\u4e3a\u4ec0\u4e48\u662f\u201c\u7ec4\u76f8\u5bf9\u201d (Group Relative)\uff1f<\/strong><\/p>\n<p>GRPO \u5b8c\u5168\u79fb\u9664\u4e86\u4ef7\u503c\u6a21\u578b\uff0c\u4f46\u6211\u4eec\u4ecd\u7136\u9700\u8981\u4f30\u8ba1\u7ed9\u5b9a\u5f53\u524d\u72b6\u6001\u7684**\u201c\u5e73\u5747\u5956\u52b1\u201d**\u3002<\/p>\n<p><strong>\u8bc0\u7a8d\u5728\u4e8e\u5bf9\u5927\u8bed\u8a00\u6a21\u578b\u8fdb\u884c\u91c7\u6837<\/strong>\uff01\u7136\u540e\uff0c\u6211\u4eec\u901a\u8fc7\u5bf9\u591a\u4e2a\u4e0d\u540c\u95ee\u9898\u7684\u91c7\u6837\u8fc7\u7a0b\u8fdb\u884c\u7edf\u8ba1\u6765\u8ba1\u7b97\u5e73\u5747\u5956\u52b1\u3002<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-31795\" title=\"\u4ece\u5165\u95e8\u5230\u7cbe\u901a\uff1a\u6df1\u5165\u89e3\u6790\u5f3a\u5316\u5b66\u4e60\u4e0e GRPO \u6a21\u578b\u8bad\u7ec3-1\" src=\"https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/06\/9e21965b4c6175c.jpg\" alt=\"\u4ece\u5165\u95e8\u5230\u7cbe\u901a\uff1a\u6df1\u5165\u89e3\u6790\u5f3a\u5316\u5b66\u4e60\u4e0e GRPO \u6a21\u578b\u8bad\u7ec3-1\" width=\"1200\" height=\"674\" srcset=\"https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/06\/9e21965b4c6175c.jpg 1200w, https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/06\/9e21965b4c6175c-18x10.jpg 18w\" sizes=\"auto, (max-width: 1200px) 100vw, 1200px\" \/><\/p>\n<p>\u4f8b\u5982\uff0c\u5bf9\u4e8e\u201c2+2 \u7b49\u4e8e\u51e0\uff1f\u201d\u8fd9\u4e2a\u95ee\u9898\uff0c\u6211\u4eec\u91c7\u6837 4 \u6b21\u3002\u6211\u4eec\u53ef\u80fd\u4f1a\u5f97\u5230 4\u30013\u3001D\u3001C\u3002\u7136\u540e\u6211\u4eec\u8ba1\u7b97\u8fd9\u4e9b\u7b54\u6848\u4e2d\u6bcf\u4e00\u4e2a\u7684\u5956\u52b1\uff0c\u518d\u8ba1\u7b97<strong>\u5e73\u5747\u5956\u52b1<\/strong>\u548c<strong>\u6807\u51c6\u5dee<\/strong>\uff0c\u7136\u540e\u8fdb\u884c\u00a0<strong>Z-score \u6807\u51c6\u5316<\/strong>\uff01<\/p>\n<p>\u8fd9\u5c31\u521b\u5efa\u4e86<strong>\u4f18\u52bf A<\/strong>\u00a0(advantages A)\uff0c\u6211\u4eec\u5c06\u7528\u5b83\u6765\u66ff\u4ee3\u4ef7\u503c\u6a21\u578b\u3002\u8fd9\u8282\u7701\u4e86\u5927\u91cf\u5185\u5b58\uff01<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-31796\" title=\"\u4ece\u5165\u95e8\u5230\u7cbe\u901a\uff1a\u6df1\u5165\u89e3\u6790\u5f3a\u5316\u5b66\u4e60\u4e0e GRPO \u6a21\u578b\u8bad\u7ec3-1\" src=\"https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/06\/9ea786e1f22341d.jpg\" alt=\"\u4ece\u5165\u95e8\u5230\u7cbe\u901a\uff1a\u6df1\u5165\u89e3\u6790\u5f3a\u5316\u5b66\u4e60\u4e0e GRPO \u6a21\u578b\u8bad\u7ec3-1\" width=\"1200\" height=\"674\" srcset=\"https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/06\/9ea786e1f22341d.jpg 1200w, https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/06\/9ea786e1f22341d-18x10.jpg 18w\" sizes=\"auto, (max-width: 1200px) 100vw, 1200px\" \/><br \/>\n<em>GRPO \u4f18\u52bf\u8ba1\u7b97<\/em><\/p>\n<h2>\ud83e\udd1e \u8fd0\u6c14 (\u6216\u8005\u8bf4\u8010\u5fc3) \u5c31\u662f\u4f60\u6240\u9700\u8981\u7684\u4e00\u5207<\/h2>\n<p>RL \u7684\u8bc0\u7a8d\u5728\u4e8e\u4f60\u53ea\u9700\u8981\u4e24\u6837\u4e1c\u897f\uff1a<\/p>\n<ol>\n<li>\u4e00\u4e2a\u95ee\u9898\u6216\u6307\u4ee4\uff0c\u4f8b\u5982\u201c2+2 \u7b49\u4e8e\u51e0\uff1f\u201d\u201c\u7528 Python \u521b\u5efa\u4e00\u4e2a Flappy Bird \u6e38\u620f\u201d<\/li>\n<li>\u4e00\u4e2a\u5956\u52b1\u51fd\u6570\u548c\u9a8c\u8bc1\u5668\uff0c\u7528\u4e8e\u9a8c\u8bc1\u8f93\u51fa\u662f\u597d\u662f\u574f\u3002<\/li>\n<\/ol>\n<p>\u4ec5\u51ed\u8fd9\u4e24\u6837\uff0c\u6211\u4eec\u57fa\u672c\u4e0a\u53ef\u4ee5<strong>\u65e0\u9650\u6b21\u5730\u8c03\u7528\u4e00\u4e2a\u8bed\u8a00\u6a21\u578b<\/strong>\uff0c\u76f4\u5230\u6211\u4eec\u5f97\u5230\u4e00\u4e2a\u597d\u7684\u7b54\u6848\u3002\u4f8b\u5982\u5bf9\u4e8e\u201c2+2 \u7b49\u4e8e\u51e0\uff1f\u201d\uff0c\u4e00\u4e2a\u672a\u7ecf\u8bad\u7ec3\u7684\u5dee\u7684\u8bed\u8a00\u6a21\u578b\u4f1a\u8f93\u51fa\uff1a<\/p>\n<p><em><strong>0, cat, -10, 1928, 3, A, B, 122, 17, 182, 172, A, C, BAHS, %$, #, 9, -192, 12.31 \u7136\u540e\u7a81\u7136\u662f 4.<\/strong><\/em><\/p>\n<p><em><strong>\u5956\u52b1\u4fe1\u53f7\u662f 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 \u7136\u540e\u7a81\u7136\u662f 1.<\/strong><\/em><\/p>\n<p>\u56e0\u6b64\uff0c\u901a\u8fc7\u8fd0\u6c14\u548c\u5076\u7136\uff0cRL \u5728\u591a\u6b21**\u201c\u63a8\u6f14\u201d** (rollouts) \u4e2d\u6210\u529f\u627e\u5230\u4e86\u6b63\u786e\u7b54\u6848\u3002\u6211\u4eec\u7684\u76ee\u6807\u662f\u5e0c\u671b\u66f4\u591a\u5730\u770b\u5230\u597d\u7684\u7b54\u6848 4\uff0c\u800c\u5176\u4f59\u7684 (\u574f\u7684\u7b54\u6848) \u5219\u5c11\u5f97\u591a\u3002<\/p>\n<p><strong>\u6240\u4ee5 RL \u7684\u76ee\u6807\u662f\u8981\u6709\u8010\u5fc3\u2014\u2014\u5728\u6781\u9650\u60c5\u51b5\u4e0b\uff0c\u5982\u679c\u6b63\u786e\u7b54\u6848\u7684\u6982\u7387\u81f3\u5c11\u662f\u4e00\u4e2a\u5f88\u5c0f\u7684\u6570\u5b57 (\u4e0d\u662f\u96f6)\uff0c\u8fd9\u53ea\u662f\u4e00\u4e2a\u7b49\u5f85\u6e38\u620f\u2014\u2014\u4f60 100% \u80af\u5b9a\u4f1a\u5728\u6781\u9650\u60c5\u51b5\u4e0b\u9047\u5230\u6b63\u786e\u7b54\u6848\u3002<\/strong><\/p>\n<p><strong>\u6240\u4ee5\u6211\u559c\u6b22\u79f0\u4e4b\u4e3a\u5f3a\u5316\u5b66\u4e60\u7684\u201c\u8fd0\u6c14\u5c31\u662f\u4f60\u6240\u9700\u8981\u7684\u4e00\u5207\u201d\u3002<\/strong><\/p>\n<p><strong>\u55ef\uff0c\u4e00\u4e2a\u66f4\u597d\u7684\u8bf4\u6cd5\u662f\u5f3a\u5316\u5b66\u4e60\u7684\u201c\u8010\u5fc3\u5c31\u662f\u4f60\u6240\u9700\u8981\u7684\u4e00\u5207\u201d\u3002<\/strong><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-31797\" title=\"\u4ece\u5165\u95e8\u5230\u7cbe\u901a\uff1a\u6df1\u5165\u89e3\u6790\u5f3a\u5316\u5b66\u4e60\u4e0e GRPO \u6a21\u578b\u8bad\u7ec3-1\" src=\"https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/06\/cfe9ee898b4b84b.jpg\" alt=\"\u4ece\u5165\u95e8\u5230\u7cbe\u901a\uff1a\u6df1\u5165\u89e3\u6790\u5f3a\u5316\u5b66\u4e60\u4e0e GRPO \u6a21\u578b\u8bad\u7ec3-1\" width=\"1200\" height=\"256\" srcset=\"https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/06\/cfe9ee898b4b84b.jpg 1200w, https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/06\/cfe9ee898b4b84b-18x4.jpg 18w\" sizes=\"auto, (max-width: 1200px) 100vw, 1200px\" \/><\/p>\n<p>RL \u672c\u8d28\u4e0a\u4e3a\u6211\u4eec\u63d0\u4f9b\u4e86\u4e00\u4e2a\u8bc0\u7a8d\u2014\u2014\u6211\u4eec\u4e0d\u662f\u7b80\u5355\u5730\u7b49\u5f85\u5230\u65e0\u7a77\u5927\uff0c\u800c\u662f\u4f1a\u5f97\u5230\u201c\u574f\u4fe1\u53f7\u201d\uff0c\u5373\u574f\u7684\u7b54\u6848\uff0c\u6211\u4eec\u53ef\u4ee5\u6709\u6548\u5730\u201c\u5f15\u5bfc\u201d\u6a21\u578b\uff0c\u4f7f\u5176\u5c3d\u91cf\u4e0d\u751f\u6210\u574f\u7684\u89e3\u51b3\u65b9\u6848\u3002\u8fd9\u610f\u5473\u7740\u5c3d\u7ba1\u4f60\u4e3a\u7b49\u5f85\u4e00\u4e2a\u201c\u597d\u201d\u7b54\u6848\u7684\u51fa\u73b0\u7b49\u4e86\u5f88\u4e45\uff0c\u4f46\u6a21\u578b\u5df2\u7ecf\u53d1\u751f\u4e86\u6539\u53d8\uff0c\u5b83\u4f1a\u5c3d\u529b\u4e0d\u8f93\u51fa\u574f\u7684\u7b54\u6848\u3002<\/p>\n<p>\u5728\u201c2+2 \u7b49\u4e8e\u51e0\uff1f\u201d\u7684\u4f8b\u5b50\u4e2d\uff1a<\/p>\n<p><em><strong>0, cat, -10, 1928, 3, A, B, 122, 17, 182, 172, A, C, BAHS, %$, #, 9, -192, 12.31 \u7136\u540e\u7a81\u7136\u662f 4.<\/strong><\/em><\/p>\n<p>\u7531\u4e8e\u6211\u4eec\u5f97\u5230\u4e86\u574f\u7684\u7b54\u6848\uff0cRL \u4f1a\u5f71\u54cd\u6a21\u578b\uff0c\u4f7f\u5176\u5c3d\u91cf\u4e0d\u53bb\u8f93\u51fa\u574f\u7684\u7b54\u6848\u3002\u8fd9\u610f\u5473\u7740\u968f\u7740\u65f6\u95f4\u7684\u63a8\u79fb\uff0c\u6211\u4eec\u6b63\u5728\u5c0f\u5fc3\u5730\u201c\u4fee\u526a\u201d\u6216\u79fb\u52a8\u6a21\u578b\u7684\u8f93\u51fa\u5206\u5e03\uff0c\u4f7f\u5176\u8fdc\u79bb\u574f\u7684\u7b54\u6848\u3002\u8fd9\u610f\u5473\u7740 RL \u5e76\u975e\u4f4e\u6548\uff0c\u56e0\u4e3a\u6211\u4eec\u4e0d\u53ea\u662f\u5728\u7b49\u5f85\u65e0\u7a77\u5927\uff0c\u800c\u662f\u5728\u79ef\u6781\u5730\u5c1d\u8bd5\u201c\u63a8\u52a8\u201d\u6a21\u578b\u5c3d\u53ef\u80fd\u5730\u8fdb\u5165\u201c\u6b63\u786e\u7b54\u6848\u7a7a\u95f4\u201d\u3002<\/p>\n<p><strong>\u5982\u679c\u6982\u7387\u59cb\u7ec8\u4e3a 0\uff0c\u90a3\u4e48 RL \u5c06\u6c38\u8fdc\u4e0d\u4f1a\u8d77\u4f5c\u7528<\/strong>\u3002\u8fd9\u4e5f\u662f\u4e3a\u4ec0\u4e48\u4eba\u4eec\u559c\u6b22\u4ece\u4e00\u4e2a\u5df2\u7ecf\u7ecf\u8fc7\u6307\u4ee4\u5fae\u8c03\u7684\u6a21\u578b\u5f00\u59cb\u8fdb\u884c RL\uff0c\u8fd9\u79cd\u6a21\u578b\u53ef\u4ee5\u90e8\u5206\u5730\u3001\u5408\u7406\u5730\u9075\u5faa\u6307\u4ee4\u2014\u2014\u8fd9\u5f88\u53ef\u80fd\u4f1a\u5c06\u6982\u7387\u63d0\u5347\u5230 0 \u4ee5\u4e0a\u3002<\/p>\n<h2>\ud83e\udda5 Unsloth \u4e3a RL \u63d0\u4f9b\u4e86\u4ec0\u4e48<\/h2>\n<ul>\n<li>\u53ea\u9700 15GB VRAM\uff0cUnsloth \u5c31\u80fd\u8ba9\u4f60\u5c06\u4efb\u4f55\u9ad8\u8fbe 17B \u53c2\u6570\u7684\u6a21\u578b\uff0c\u5982 Llama 3.1 (8B)\u3001Phi-4 (14B)\u3001Mistral (7B) \u6216 Qwen2.5 (7B)\uff0c\u8f6c\u53d8\u4e3a\u4e00\u4e2a\u63a8\u7406\u6a21\u578b\u3002<\/li>\n<li>**\u6700\u4f4e\u8981\u6c42\uff1a**\u53ea\u9700 5GB VRAM \u5c31\u8db3\u4ee5\u5728\u672c\u5730\u8bad\u7ec3\u4f60\u81ea\u5df1\u7684\u63a8\u7406\u6a21\u578b (\u9002\u7528\u4e8e\u4efb\u4f55 1.5B \u53c2\u6570\u6216\u66f4\u5c0f\u7684\u6a21\u578b)\u3002<\/li>\n<\/ul>\n<p>\u26a1\u00a0<strong>\u6559\u7a0b\uff1a\u4f7f\u7528 GRPO \u8bad\u7ec3\u4f60\u81ea\u5df1\u7684\u63a8\u7406\u6a21\u578b<\/strong><\/p>\n<h3>GRPO \u7b14\u8bb0\u672c<\/h3>\n<ul>\n<li><strong>Qwen3 (4B)<\/strong>\u00a0&#8211; \u9ad8\u7ea7<\/li>\n<li><strong>DeepSeek-R1-0528-Qwen3-8B<\/strong>\u00a0<strong>&#8211; \u65b0<\/strong><\/li>\n<li>Llama 3.2 (3B)\u00a0&#8211; \u9ad8\u7ea7<\/li>\n<li><a href=\"https:\/\/www.kdjingpai.com\/en\/gemma-3-jishubaogao\/\">Gemma 3<\/a> (1B)<\/li>\n<li>Phi-4 (14B)<\/li>\n<li>Qwen2.5 (3B)<\/li>\n<li><a href=\"https:\/\/www.kdjingpai.com\/en\/le-chat-mistral\/\">Mistral<\/a> v0.3 (7B)<\/li>\n<li>Llama 3.1 (8B)<\/li>\n<\/ul>\n<p>**\u65b0\u529f\u80fd\uff01**\u6211\u4eec\u73b0\u5728\u652f\u6301 Dr. GRPO \u548c\u5927\u591a\u6570\u5176\u4ed6\u65b0\u7684 GRPO \u6280\u672f\u3002\u4f60\u53ef\u4ee5\u5728\u00a0<code>GRPOConfig<\/code>\u00a0\u4e2d\u4f7f\u7528\u4ee5\u4e0b\u53c2\u6570\u6765\u542f\u7528\u5b83\u4eec\uff1a<\/p>\n<p>\u590d\u5236<\/p>\n<pre><code>epsilon=0.2,\r\nepsilon_high=0.28, # one sided\r\ndelta=1.5, # two sided\r\nloss_type='bnpo',\r\n# or:\r\nloss_type='grpo',\r\n# or:\r\nloss_type='dr_grpo',\r\nmask_truncated_completions=True,\r\n<\/code><\/pre>\n<ul>\n<li>\u5982\u679c\u4f60\u6ca1\u6709\u5f97\u5230\u4efb\u4f55\u63a8\u7406\u7ed3\u679c\uff0c\u8bf7\u786e\u4fdd\u4f60\u6709\u8db3\u591f\u7684\u8bad\u7ec3\u6b65\u6570\uff0c\u5e76\u786e\u4fdd\u4f60\u7684\u5956\u52b1\u51fd\u6570\/\u9a8c\u8bc1\u5668\u6b63\u5e38\u5de5\u4f5c\u3002\u6211\u4eec\u5728\u8fd9\u91cc\u63d0\u4f9b\u4e86\u5956\u52b1\u51fd\u6570\u7684\u793a\u4f8b\u6b64\u5904\u3002<\/li>\n<li>\u4e4b\u524d\u7684\u6f14\u793a\u8868\u660e\uff0c\u4f60\u53ef\u4ee5\u7528 Qwen2.5 (3B) \u5b9e\u73b0\u4f60\u81ea\u5df1\u7684\u201c\u987f\u609f\u201d\u65f6\u523b\u2014\u2014\u4f46\u8fd9\u9700\u8981 2xA100 GPU (160GB VRAM)\u3002\u73b0\u5728\uff0c\u4f7f\u7528 Unsloth\uff0c\u4f60\u53ea\u9700\u4e00\u4e2a 5GB VRAM \u7684 GPU \u5c31\u80fd\u5b9e\u73b0\u540c\u6837\u7684\u201c\u987f\u609f\u201d\u65f6\u523b\u3002<\/li>\n<li>\u4ee5\u524d\uff0cGRPO \u4ec5\u652f\u6301\u5168\u91cf\u5fae\u8c03\uff0c\u4f46\u6211\u4eec\u5df2\u7ecf\u4f7f\u5176\u80fd\u591f\u4e0e QLoRA \u548c LoRA \u914d\u5408\u4f7f\u7528\u3002<\/li>\n<li>\u4f8b\u5982\uff0c\u5728<strong>20K \u4e0a\u4e0b\u6587\u957f\u5ea6<\/strong>\u4e0b\uff0c\u6bcf\u4e2a\u63d0\u793a\u751f\u6210 8 \u4e2a\u56de\u590d\uff0cUnsloth \u5bf9 Llama 3.1 (8B) \u4ec5\u4f7f\u7528 54.3GB \u7684 VRAM\uff0c\u800c\u6807\u51c6\u5b9e\u73b0 (+ Flash Attention 2) \u5219\u9700\u8981\u00a0<strong>510.8GB (Unsloth \u51cf\u5c11\u4e86 90%)<\/strong>\u3002<\/li>\n<li>\u8bf7\u6ce8\u610f\uff0c\u8fd9\u5e76\u4e0d\u662f\u5fae\u8c03 DeepSeek \u7684 R1 \u84b8\u998f\u6a21\u578b\u6216\u4f7f\u7528 R1 \u7684\u84b8\u998f\u6570\u636e\u8fdb\u884c\u8c03\u4f18\uff0cUnsloth \u5df2\u7ecf\u652f\u6301\u8fd9\u4e9b\u529f\u80fd\u3002\u8fd9\u662f\u5c06\u4e00\u4e2a\u6807\u51c6\u6a21\u578b\u4f7f\u7528 GRPO \u8f6c\u6362\u4e3a\u4e00\u4e2a\u6210\u719f\u7684\u63a8\u7406\u6a21\u578b\u3002<\/li>\n<\/ul>\n<p>\u5728\u4e00\u4e2a\u6d4b\u8bd5\u793a\u4f8b\u4e2d\uff0c\u5c3d\u7ba1\u6211\u4eec\u53ea\u7528 GRPO \u8bad\u7ec3\u4e86 Phi-4 100 \u6b65\uff0c\u7ed3\u679c\u5df2\u7ecf\u5f88\u660e\u663e\u3002\u6ca1\u6709\u4f7f\u7528 GRPO \u7684\u6a21\u578b\u6ca1\u6709\u601d\u8003 Token\uff0c\u800c\u7528 GRPO \u8bad\u7ec3\u8fc7\u7684\u6a21\u578b\u5219\u6709\uff0c\u5e76\u4e14\u7b54\u6848\u4e5f\u662f\u6b63\u786e\u7684\u3002<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-31798\" title=\"\u4ece\u5165\u95e8\u5230\u7cbe\u901a\uff1a\u6df1\u5165\u89e3\u6790\u5f3a\u5316\u5b66\u4e60\u4e0e GRPO \u6a21\u578b\u8bad\u7ec3-1\" src=\"https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/06\/9c803c3c355d8fb.jpg\" alt=\"\u4ece\u5165\u95e8\u5230\u7cbe\u901a\uff1a\u6df1\u5165\u89e3\u6790\u5f3a\u5316\u5b66\u4e60\u4e0e GRPO \u6a21\u578b\u8bad\u7ec3-1\" width=\"1200\" height=\"574\" srcset=\"https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/06\/9c803c3c355d8fb.jpg 1200w, https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/06\/9c803c3c355d8fb-18x9.jpg 18w\" sizes=\"auto, (max-width: 1200px) 100vw, 1200px\" \/><\/p>\n<h2>\ud83d\udcbb \u4f7f\u7528 GRPO \u8fdb\u884c\u8bad\u7ec3<\/h2>\n<p>\u5173\u4e8e\u5982\u4f55\u4f7f\u7528 Unsloth \u548c GRPO \u5c06\u4efb\u4f55\u5f00\u653e\u7684\u5927\u8bed\u8a00\u6a21\u578b\u8f6c\u6362\u4e3a\u63a8\u7406\u6a21\u578b\u7684\u6559\u7a0b\uff0c\u8bf7\u89c1\u6b64\u5904\u3002<\/p>\n<h3>GRPO \u5982\u4f55\u8bad\u7ec3\u6a21\u578b<\/h3>\n<ol>\n<li>\u5bf9\u4e8e\u6bcf\u4e2a\u95ee\u7b54\u5bf9\uff0c\u6a21\u578b\u4f1a\u751f\u6210\u591a\u4e2a\u53ef\u80fd\u7684\u54cd\u5e94 (\u4f8b\u5982\uff0c8 \u4e2a\u53d8\u4f53)\u3002<\/li>\n<li>\u6bcf\u4e2a\u54cd\u5e94\u90fd\u4f7f\u7528\u5956\u52b1\u51fd\u6570\u8fdb\u884c\u8bc4\u4f30\u3002<\/li>\n<li>\u8bad\u7ec3\u6b65\u9aa4\uff1a\n<ul>\n<li>\u5982\u679c\u4f60\u6709 300 \u884c\u6570\u636e\uff0c\u90a3\u5c31\u662f 300 \u4e2a\u8bad\u7ec3\u6b65\u9aa4 (\u5982\u679c\u8bad\u7ec3 3 \u4e2a epoch\uff0c\u5219\u662f 900 \u6b65)\u3002<\/li>\n<li>\u4f60\u53ef\u4ee5\u589e\u52a0\u6bcf\u4e2a\u95ee\u9898\u751f\u6210\u7684\u54cd\u5e94\u6570\u91cf (\u4f8b\u5982\uff0c\u4ece 8 \u4e2a\u589e\u52a0\u5230 16 \u4e2a)\u3002<\/li>\n<\/ul>\n<\/li>\n<li>\u6a21\u578b\u901a\u8fc7\u6bcf\u4e00\u6b65\u66f4\u65b0\u5176\u6743\u91cd\u6765\u5b66\u4e60\u3002<\/li>\n<\/ol>\n<p>\u5982\u679c\u4f60\u7684 GRPO \u6a21\u578b\u5b66\u4e60\u51fa\u73b0\u95ee\u9898\uff0c\u6211\u4eec\u5f3a\u70c8\u5efa\u8bae\u4f60\u4f7f\u7528\u6211\u4eec\u7684\u9ad8\u7ea7 GRPO \u7b14\u8bb0\u672c\uff0c\u56e0\u4e3a\u5b83\u6709\u4e00\u4e2a\u66f4\u597d\u7684\u5956\u52b1\u51fd\u6570\uff0c\u4f60\u5e94\u8be5\u80fd\u66f4\u5feb\u3001\u66f4\u9891\u7e41\u5730\u770b\u5230\u7ed3\u679c\u3002<\/p>\n<h3>\u57fa\u7840\u77e5\u8bc6\/\u6280\u5de7<\/h3>\n<ul>\n<li>\u81f3\u5c11\u7b49\u5f85\u00a0<strong>300 \u6b65<\/strong>\uff0c\u5956\u52b1\u624d\u53ef\u80fd\u771f\u6b63\u589e\u52a0\u3002\u4e3a\u4e86\u83b7\u5f97\u4e0d\u9519\u7684\u7ed3\u679c\uff0c\u4f60\u53ef\u80fd\u9700\u8981\u81f3\u5c11\u8bad\u7ec3 12 \u5c0f\u65f6 (\u8fd9\u5c31\u662f GRPO \u7684\u5de5\u4f5c\u65b9\u5f0f)\uff0c\u4f46\u8bf7\u8bb0\u4f4f\u8fd9\u4e0d\u662f\u5f3a\u5236\u6027\u7684\uff0c\u4f60\u53ef\u4ee5\u968f\u65f6\u505c\u6b62\u3002<\/li>\n<li>\u4e3a\u83b7\u5f97\u6700\u4f73\u6548\u679c\uff0c\u81f3\u5c11\u8981\u6709\u00a0<strong>500 \u884c\u6570\u636e<\/strong>\u3002\u4f60\u751a\u81f3\u53ef\u4ee5\u5c1d\u8bd5\u7528 10 \u884c\u6570\u636e\uff0c\u4f46\u66f4\u591a\u6570\u636e\u4f1a\u66f4\u597d\u3002<\/li>\n<li>\u6bcf\u6b21\u8bad\u7ec3\u8fd0\u884c\u90fd\u4f1a\u56e0\u4f60\u7684\u6a21\u578b\u3001\u6570\u636e\u3001\u5956\u52b1\u51fd\u6570\/\u9a8c\u8bc1\u5668\u7b49\u800c\u5f02\u3002\u56e0\u6b64\uff0c\u5c3d\u7ba1\u6211\u4eec\u5199\u4e86\u6700\u4f4e 300 \u6b65\uff0c\u4f46\u6709\u65f6\u53ef\u80fd\u9700\u8981 1000 \u6b65\u6216\u66f4\u591a\u3002\u6240\u4ee5\uff0c\u8fd9\u53d6\u51b3\u4e8e\u591a\u79cd\u56e0\u7d20\u3002<\/li>\n<li>\u5982\u679c\u4f60\u5728\u672c\u5730\u4f7f\u7528 Unsloth \u8fdb\u884c GRPO\uff0c\u5982\u679c\u9047\u5230\u9519\u8bef\uff0c\u8bf7\u4e5f\u00a0<code>pip install diffusers<\/code>\u3002\u53e6\u5916\u8bf7\u4f7f\u7528\u6700\u65b0\u7248\u672c\u7684 vLLM\u3002<\/li>\n<li>\u5efa\u8bae\u5c06 GRPO \u5e94\u7528\u4e8e\u81f3\u5c11\u00a0<strong>1.5B \u53c2\u6570<\/strong>\u7684\u6a21\u578b\uff0c\u4ee5\u4fbf\u6b63\u786e\u751f\u6210\u601d\u8003 Token\uff0c\u56e0\u4e3a\u8f83\u5c0f\u7684\u6a21\u578b\u53ef\u80fd\u505a\u4e0d\u5230\u3002<\/li>\n<li>\u5173\u4e8e GRPO \u7684<strong>GPU VRAM \u8981\u6c42<\/strong>\u00a0<strong>(QLoRA 4-bit)<\/strong>\uff0c\u4e00\u822c\u89c4\u5219\u662f\u6a21\u578b\u53c2\u6570 = \u4f60\u5c06\u9700\u8981\u7684 VRAM \u6570\u91cf (\u4f60\u53ef\u4ee5\u4f7f\u7528\u66f4\u5c11\u7684 VRAM\uff0c\u4f46\u8fd9\u53ea\u662f\u4e3a\u4e86\u5b89\u5168\u8d77\u89c1)\u3002\u4f60\u8bbe\u7f6e\u7684\u4e0a\u4e0b\u6587\u957f\u5ea6\u8d8a\u957f\uff0c\u9700\u8981\u7684 VRAM \u5c31\u8d8a\u591a\u3002LoRA 16-bit \u81f3\u5c11\u4f1a\u4f7f\u7528 4 \u500d\u4ee5\u4e0a\u7684 VRAM\u3002<\/li>\n<li><strong>\u6301\u7eed\u5fae\u8c03\u662f<\/strong>\u53ef\u80fd\u7684\uff0c\u4f60\u53ef\u4ee5\u8ba9 GRPO \u5728\u540e\u53f0\u4e00\u76f4\u8fd0\u884c\u3002<\/li>\n<li>\u5728\u793a\u4f8b\u7b14\u8bb0\u672c\u4e2d\uff0c\u6211\u4eec\u4f7f\u7528\u4e86<strong>GSM8K \u6570\u636e\u96c6<\/strong>\uff0c\u8fd9\u662f\u76ee\u524d R1 \u98ce\u683c\u8bad\u7ec3\u6700\u53d7\u6b22\u8fce\u7684\u9009\u62e9\u3002<\/li>\n<li>\u5982\u679c\u4f60\u4f7f\u7528\u7684\u662f\u57fa\u7840\u6a21\u578b\uff0c\u8bf7\u786e\u4fdd\u4f60\u6709\u4e00\u4e2a\u804a\u5929\u6a21\u677f\u3002<\/li>\n<li>\u4f7f\u7528 GRPO \u8bad\u7ec3\u5f97\u8d8a\u591a\u8d8a\u597d\u3002GRPO \u7684\u6700\u5927\u4f18\u70b9\u5728\u4e8e\u4f60\u751a\u81f3\u4e0d\u9700\u8981\u592a\u591a\u6570\u636e\u3002\u4f60\u6240\u9700\u8981\u7684\u53ea\u662f\u4e00\u4e2a\u51fa\u8272\u7684\u5956\u52b1\u51fd\u6570\/\u9a8c\u8bc1\u5668\uff0c\u8bad\u7ec3\u65f6\u95f4\u8d8a\u957f\uff0c\u4f60\u7684\u6a21\u578b\u5c31\u4f1a\u53d8\u5f97\u8d8a\u597d\u3002\u9884\u8ba1\u4f60\u7684\u5956\u52b1\u4e0e\u6b65\u6570\u7684\u5173\u7cfb\u4f1a\u968f\u7740\u65f6\u95f4\u7684\u63a8\u79fb\u800c\u589e\u52a0\uff0c\u5982\u4e0b\u56fe\u6240\u793a\uff1a<\/li>\n<\/ul>\n<ul>\n<li><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-31799\" title=\"\u4ece\u5165\u95e8\u5230\u7cbe\u901a\uff1a\u6df1\u5165\u89e3\u6790\u5f3a\u5316\u5b66\u4e60\u4e0e GRPO \u6a21\u578b\u8bad\u7ec3-1\" src=\"https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/06\/b86366c873b5510.jpg\" alt=\"\u4ece\u5165\u95e8\u5230\u7cbe\u901a\uff1a\u6df1\u5165\u89e3\u6790\u5f3a\u5316\u5b66\u4e60\u4e0e GRPO \u6a21\u578b\u8bad\u7ec3-1\" width=\"1200\" height=\"316\" srcset=\"https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/06\/b86366c873b5510.jpg 1200w, https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/06\/b86366c873b5510-18x5.jpg 18w\" sizes=\"auto, (max-width: 1200px) 100vw, 1200px\" \/>GRPO \u7684\u8bad\u7ec3\u635f\u5931\u8ddf\u8e2a\u73b0\u5df2\u76f4\u63a5\u5185\u7f6e\u4e8e Unsloth \u4e2d\uff0c\u65e0\u9700\u4f7f\u7528 wandb \u7b49\u5916\u90e8\u5de5\u5177\u3002\u5b83\u5305\u542b\u4e86\u6240\u6709\u5956\u52b1\u51fd\u6570\u7684\u5b8c\u6574\u65e5\u5fd7\u8bb0\u5f55\u7ec6\u8282\uff0c\u5305\u62ec\u603b\u7684\u805a\u5408\u5956\u52b1\u51fd\u6570\u672c\u8eab\u3002<\/li>\n<\/ul>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-31800\" title=\"\u4ece\u5165\u95e8\u5230\u7cbe\u901a\uff1a\u6df1\u5165\u89e3\u6790\u5f3a\u5316\u5b66\u4e60\u4e0e GRPO \u6a21\u578b\u8bad\u7ec3-1\" src=\"https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/06\/990aa8f91118228.jpg\" alt=\"\u4ece\u5165\u95e8\u5230\u7cbe\u901a\uff1a\u6df1\u5165\u89e3\u6790\u5f3a\u5316\u5b66\u4e60\u4e0e GRPO \u6a21\u578b\u8bad\u7ec3-1\" width=\"1200\" height=\"454\" srcset=\"https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/06\/990aa8f91118228.jpg 1200w, https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/06\/990aa8f91118228-18x7.jpg 18w\" sizes=\"auto, (max-width: 1200px) 100vw, 1200px\" \/><\/p>\n<h2>\ud83d\udccb \u5956\u52b1\u51fd\u6570 \/ \u9a8c\u8bc1\u5668<\/h2>\n<p>\u5728\u5f3a\u5316\u5b66\u4e60\u4e2d\uff0c<strong>\u5956\u52b1\u51fd\u6570<\/strong>\u00a0(Reward Function) \u548c<strong>\u9a8c\u8bc1\u5668<\/strong>\u00a0(Verifier) \u5728\u8bc4\u4f30\u6a21\u578b\u8f93\u51fa\u65b9\u9762\u626e\u6f14\u7740\u4e0d\u540c\u7684\u89d2\u8272\u3002\u603b\u7684\u6765\u8bf4\uff0c\u4f60\u53ef\u4ee5\u5c06\u5b83\u4eec\u7406\u89e3\u4e3a\u540c\u4e00\u56de\u4e8b\uff0c\u4f46\u6280\u672f\u4e0a\u5b83\u4eec\u5e76\u4e0d\u76f8\u540c\uff0c\u4f46\u8fd9\u5e76\u4e0d\u91cd\u8981\uff0c\u56e0\u4e3a\u5b83\u4eec\u901a\u5e38\u662f\u7ed3\u5408\u4f7f\u7528\u7684\u3002<\/p>\n<p><strong>\u9a8c\u8bc1\u5668<\/strong>\u00a0(Verifier)\uff1a<\/p>\n<ul>\n<li>\u786e\u5b9a\u751f\u6210\u7684\u54cd\u5e94\u662f\u6b63\u786e\u8fd8\u662f\u4e0d\u6b63\u786e\u3002<\/li>\n<li>\u5b83\u4e0d\u5206\u914d\u6570\u503c\u5206\u6570\u2014\u2014\u5b83\u53ea\u662f\u9a8c\u8bc1\u6b63\u786e\u6027\u3002<\/li>\n<li><strong>\u793a\u4f8b\uff1a<\/strong>\u00a0\u5982\u679c\u4e00\u4e2a\u6a21\u578b\u5bf9\u201c2+2\u201d\u751f\u6210\u4e86\u201c5\u201d\uff0c\u9a8c\u8bc1\u5668\u4f1a\u68c0\u67e5\u5e76\u5c06\u5176\u6807\u8bb0\u4e3a\u201c\u9519\u8bef\u201d (\u56e0\u4e3a\u6b63\u786e\u7b54\u6848\u662f 4)\u3002<\/li>\n<li>\u9a8c\u8bc1\u5668\u8fd8\u53ef\u4ee5\u6267\u884c\u4ee3\u7801 (\u4f8b\u5982\uff0c\u5728 Python \u4e2d) \u6765\u9a8c\u8bc1\u903b\u8f91\u3001\u8bed\u6cd5\u548c\u6b63\u786e\u6027\uff0c\u800c\u65e0\u9700\u4eba\u5de5\u8bc4\u4f30\u3002<\/li>\n<\/ul>\n<p><strong>\u5956\u52b1\u51fd\u6570<\/strong>\u00a0(Reward Function)\uff1a<\/p>\n<ul>\n<li>\u5c06\u9a8c\u8bc1\u7ed3\u679c (\u6216\u5176\u4ed6\u6807\u51c6) \u8f6c\u6362\u4e3a\u6570\u503c\u5206\u6570\u3002<\/li>\n<li><strong>\u793a\u4f8b\uff1a<\/strong>\u00a0\u5982\u679c\u7b54\u6848\u662f\u9519\u8bef\u7684\uff0c\u5b83\u53ef\u80fd\u4f1a\u5206\u914d\u4e00\u4e2a\u60e9\u7f5a (-1, -2 \u7b49)\uff0c\u800c\u6b63\u786e\u7684\u7b54\u6848\u53ef\u80fd\u4f1a\u5f97\u5230\u4e00\u4e2a\u6b63\u5206 (+1, +2)\u3002<\/li>\n<li>\u5b83\u8fd8\u53ef\u4ee5\u6839\u636e\u6b63\u786e\u6027\u4ee5\u5916\u7684\u6807\u51c6\u8fdb\u884c\u60e9\u7f5a\uff0c\u4f8b\u5982\u957f\u5ea6\u8fc7\u957f\u6216\u53ef\u8bfb\u6027\u5dee\u3002<\/li>\n<\/ul>\n<p><strong>\u4e3b\u8981\u533a\u522b<\/strong>\uff1a<\/p>\n<ul>\n<li><strong>\u9a8c\u8bc1\u5668<\/strong>\u68c0\u67e5\u6b63\u786e\u6027\uff0c\u4f46\u4e0d\u6253\u5206\u3002<\/li>\n<li><strong>\u5956\u52b1\u51fd\u6570<\/strong>\u5206\u914d\u5206\u6570\uff0c\u4f46\u672c\u8eab\u4e0d\u4e00\u5b9a\u9a8c\u8bc1\u6b63\u786e\u6027\u3002<\/li>\n<li>\u5956\u52b1\u51fd\u6570<em>\u53ef\u4ee5<\/em>\u4f7f\u7528\u9a8c\u8bc1\u5668\uff0c\u4f46\u5b83\u4eec\u5728\u6280\u672f\u4e0a\u5e76\u4e0d\u76f8\u540c\u3002<\/li>\n<\/ul>\n<h3>\u7406\u89e3\u5956\u52b1\u51fd\u6570<\/h3>\n<p>GRPO \u7684\u4e3b\u8981\u76ee\u6807\u662f\u6700\u5927\u5316\u5956\u52b1\u5e76\u5b66\u4e60\u7b54\u6848\u662f\u5982\u4f55\u5f97\u51fa\u7684\uff0c\u800c\u4e0d\u4ec5\u4ec5\u662f\u8bb0\u5fc6\u548c\u590d\u73b0\u5176\u8bad\u7ec3\u6570\u636e\u4e2d\u7684\u54cd\u5e94\u3002<\/p>\n<ul>\n<li>\u5728\u6bcf\u4e2a\u8bad\u7ec3\u6b65\u9aa4\u4e2d\uff0cGRPO\u00a0<strong>\u8c03\u6574\u6a21\u578b\u6743\u91cd<\/strong>\u4ee5\u6700\u5927\u5316\u5956\u52b1\u3002\u8fd9\u4e2a\u8fc7\u7a0b\u4f1a\u9010\u6b65\u5fae\u8c03\u6a21\u578b\u3002<\/li>\n<li><strong>\u5e38\u89c4\u5fae\u8c03<\/strong>\u00a0(\u4e0d\u4f7f\u7528 GRPO) \u4ec5<strong>\u6700\u5927\u5316\u4e0b\u4e00\u4e2a\u8bcd\u7684\u9884\u6d4b\u6982\u7387<\/strong>\uff0c\u4f46\u4e0d\u4f1a\u4e3a\u5956\u52b1\u8fdb\u884c\u4f18\u5316\u3002GRPO\u00a0<strong>\u4e3a\u4e00\u4e2a\u5956\u52b1\u51fd\u6570\u8fdb\u884c\u4f18\u5316<\/strong>\uff0c\u800c\u4e0d\u4ec5\u4ec5\u662f\u9884\u6d4b\u4e0b\u4e00\u4e2a\u8bcd\u3002<\/li>\n<li>\u4f60\u53ef\u4ee5\u5728\u591a\u4e2a epoch \u4e2d<strong>\u91cd\u7528\u6570\u636e<\/strong>\u3002<\/li>\n<li><strong>\u9ed8\u8ba4\u5956\u52b1\u51fd\u6570<\/strong>\u53ef\u4ee5\u9884\u5148\u5b9a\u4e49\uff0c\u7528\u4e8e\u5404\u79cd\u7528\u4f8b\uff0c\u6216\u8005\u4f60\u53ef\u4ee5\u8ba9 ChatGPT\/\u672c\u5730\u6a21\u578b\u4e3a\u4f60\u751f\u6210\u5b83\u4eec\u3002<\/li>\n<li>\u8bbe\u8ba1\u5956\u52b1\u51fd\u6570\u6216\u9a8c\u8bc1\u5668\u6ca1\u6709\u5355\u4e00\u7684\u6b63\u786e\u65b9\u6cd5\u2014\u2014\u53ef\u80fd\u6027\u662f\u65e0\u7a77\u7684\u3002\u7136\u800c\uff0c\u5b83\u4eec\u5fc5\u987b\u7ecf\u8fc7\u7cbe\u5fc3\u8bbe\u8ba1\u4e14\u6709\u610f\u4e49\uff0c\u56e0\u4e3a\u8bbe\u8ba1\u4e0d\u4f73\u7684\u5956\u52b1\u53ef\u80fd\u4f1a\u65e0\u610f\u4e2d\u964d\u4f4e\u6a21\u578b\u6027\u80fd\u3002<\/li>\n<\/ul>\n<h3>\ud83e\ude99 \u5956\u52b1\u51fd\u6570\u793a\u4f8b<\/h3>\n<p>\u4f60\u53ef\u4ee5\u53c2\u8003\u4ee5\u4e0b\u793a\u4f8b\u3002\u4f60\u53ef\u4ee5\u5c06\u4f60\u7684\u751f\u6210\u7ed3\u679c\u8f93\u5165\u5230\u50cf <a href=\"https:\/\/www.kdjingpai.com\/en\/chatgpt-6\/\">ChatGPT<\/a> 4o \u6216 Llama 3.1 (8B) \u8fd9\u6837\u7684\u5927\u8bed\u8a00\u6a21\u578b\u4e2d\uff0c\u5e76\u8bbe\u8ba1\u4e00\u4e2a\u5956\u52b1\u51fd\u6570\u548c\u9a8c\u8bc1\u5668\u6765\u8bc4\u4f30\u5b83\u3002\u4f8b\u5982\uff0c\u5c06\u4f60\u7684\u751f\u6210\u7ed3\u679c\u8f93\u5165\u5230\u4f60\u9009\u62e9\u7684\u5927\u8bed\u8a00\u6a21\u578b\u4e2d\uff0c\u5e76\u8bbe\u5b9a\u4e00\u4e2a\u89c4\u5219\uff1a\u201c\u5982\u679c\u7b54\u6848\u542c\u8d77\u6765\u592a\u50cf\u673a\u5668\u4eba\uff0c\u6263 3 \u5206\u3002\u201d\u8fd9\u6709\u52a9\u4e8e\u6839\u636e\u8d28\u91cf\u6807\u51c6\u6765\u4f18\u5316\u8f93\u51fa\u3002<\/p>\n<h4>\u793a\u4f8b #1\uff1a\u7b80\u5355\u7b97\u672f\u4efb\u52a1<\/h4>\n<ul>\n<li><strong>\u95ee\u9898\uff1a<\/strong>\u00a0<code>\"2 + 2\"<\/code><\/li>\n<li><strong>\u7b54\u6848\uff1a<\/strong>\u00a0<code>\"4\"<\/code><\/li>\n<li><strong>\u5956\u52b1\u51fd\u6570 1\uff1a<\/strong>\n<ul>\n<li>\u5982\u679c\u68c0\u6d4b\u5230\u6570\u5b57 \u2192\u00a0<strong>+1<\/strong><\/li>\n<li>\u5982\u679c\u6ca1\u6709\u68c0\u6d4b\u5230\u6570\u5b57 \u2192\u00a0<strong>-1<\/strong><\/li>\n<\/ul>\n<\/li>\n<li><strong>\u5956\u52b1\u51fd\u6570 2\uff1a<\/strong>\n<ul>\n<li>\u5982\u679c\u6570\u5b57\u4e0e\u6b63\u786e\u7b54\u6848\u5339\u914d \u2192\u00a0<strong>+3<\/strong><\/li>\n<li>\u5982\u679c\u4e0d\u6b63\u786e \u2192\u00a0<strong>-3<\/strong><\/li>\n<\/ul>\n<\/li>\n<li><strong>\u603b\u5956\u52b1\uff1a<\/strong>\u00a0<em>\u6240\u6709\u5956\u52b1\u51fd\u6570\u7684\u603b\u548c<\/em><\/li>\n<\/ul>\n<h4>\u793a\u4f8b #2\uff1a\u7535\u5b50\u90ae\u4ef6\u81ea\u52a8\u5316\u4efb\u52a1<\/h4>\n<ul>\n<li><strong>\u95ee\u9898\uff1a<\/strong>\u00a0\u5165\u7ad9\u7535\u5b50\u90ae\u4ef6<\/li>\n<li><strong>\u7b54\u6848\uff1a<\/strong>\u00a0\u51fa\u7ad9\u7535\u5b50\u90ae\u4ef6<\/li>\n<li><strong>\u5956\u52b1\u51fd\u6570\uff1a<\/strong>\n<ul>\n<li>\u5982\u679c\u7b54\u6848\u5305\u542b\u5fc5\u9700\u7684\u5173\u952e\u8bcd \u2192\u00a0<strong>+1<\/strong><\/li>\n<li>\u5982\u679c\u7b54\u6848\u4e0e\u7406\u60f3\u54cd\u5e94\u5b8c\u5168\u5339\u914d \u2192\u00a0<strong>+1<\/strong><\/li>\n<li>\u5982\u679c\u54cd\u5e94\u592a\u957f \u2192\u00a0<strong>-1<\/strong><\/li>\n<li>\u5982\u679c\u5305\u542b\u6536\u4ef6\u4eba\u59d3\u540d \u2192\u00a0<strong>+1<\/strong><\/li>\n<li>\u5982\u679c\u5b58\u5728\u7b7e\u540d\u5757 (\u7535\u8bdd\u3001\u7535\u5b50\u90ae\u4ef6\u3001\u5730\u5740) \u2192\u00a0<strong>+1<\/strong><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3>Unsloth \u57fa\u4e8e\u90bb\u8fd1\u5ea6\u7684\u5956\u52b1\u51fd\u6570<\/h3>\n<p>\u5982\u679c\u4f60\u770b\u8fc7\u6211\u4eec\u7684<strong>\u9ad8\u7ea7 GRPO Colab \u7b14\u8bb0\u672c<\/strong>\uff0c\u4f60\u4f1a\u6ce8\u610f\u5230\u6211\u4eec\u521b\u5efa\u4e86\u4e00\u4e2a\u5b8c\u5168\u4ece\u96f6\u5f00\u59cb\u6784\u5efa\u7684<strong>\u81ea\u5b9a\u4e49\u57fa\u4e8e\u90bb\u8fd1\u5ea6\u7684\u5956\u52b1\u51fd\u6570<\/strong>\uff0c\u5b83\u65e8\u5728\u5956\u52b1\u66f4\u63a5\u8fd1\u6b63\u786e\u7b54\u6848\u7684\u7b54\u6848\u3002\u8fd9\u4e2a\u7075\u6d3b\u7684\u51fd\u6570\u53ef\u4ee5\u5e94\u7528\u4e8e\u5e7f\u6cdb\u7684\u4efb\u52a1\u3002<\/p>\n<ul>\n<li>\u5728\u6211\u4eec\u7684\u793a\u4f8b\u4e2d\uff0c\u6211\u4eec\u5728 <a href=\"https:\/\/www.kdjingpai.com\/en\/qwen3-fabushenba\/\">Qwen3<\/a> (\u57fa\u7840\u7248) \u4e2d\u542f\u7528\u4e86\u63a8\u7406\u529f\u80fd\uff0c\u5e76\u5f15\u5bfc\u5b83\u5b8c\u6210\u7279\u5b9a\u4efb\u52a1\u3002<\/li>\n<li>\u5e94\u7528\u9884\u5fae\u8c03\u7b56\u7565\u4ee5\u907f\u514d GRPO \u9ed8\u8ba4\u503e\u5411\u4e8e\u53ea\u5b66\u4e60\u683c\u5f0f\u3002<\/li>\n<li>\u4f7f\u7528\u57fa\u4e8e\u6b63\u5219\u8868\u8fbe\u5f0f\u7684\u5339\u914d\u6765\u63d0\u9ad8\u8bc4\u4f30\u51c6\u786e\u6027\u3002<\/li>\n<li>\u521b\u5efa\u8d85\u8d8a\u901a\u7528\u63d0\u793a (\u5982\u00a0<code>think<\/code>) \u7684\u81ea\u5b9a\u4e49 GRPO \u6a21\u677f\uff0c\u4f8b\u5982\u00a0<code>&lt;start_working_out&gt;&lt;\/end_working_out&gt;<\/code>\u3002<\/li>\n<li>\u5e94\u7528\u57fa\u4e8e\u90bb\u8fd1\u5ea6\u7684\u8bc4\u5206\u2014\u2014\u6a21\u578b\u56e0\u7b54\u6848\u66f4\u63a5\u8fd1\u800c\u83b7\u5f97\u66f4\u591a\u5956\u52b1 (\u4f8b\u5982\uff0c\u9884\u6d4b 9 \u800c\u4e0d\u662f 10 \u6bd4\u9884\u6d4b 3 \u8981\u597d)\uff0c\u800c\u5f02\u5e38\u503c\u5219\u4f1a\u53d7\u5230\u60e9\u7f5a\u3002<\/li>\n<\/ul>\n<h4>GSM8K \u5956\u52b1\u51fd\u6570<\/h4>\n<p>\u5728\u6211\u4eec\u7684\u5176\u4ed6\u793a\u4f8b\u4e2d\uff0c\u6211\u4eec\u4f7f\u7528\u4e86\u7531\u00a0@willccbb\u00a0\u63d0\u4f9b\u7684\u73b0\u6709 GSM8K \u5956\u52b1\u51fd\u6570\uff0c\u8fd9\u4e9b\u51fd\u6570\u5f88\u53d7\u6b22\u8fce\u4e14\u88ab\u8bc1\u660e\u76f8\u5f53\u6709\u6548\uff1a<\/p>\n<ul>\n<li><strong>correctness_reward_func<\/strong>\u00a0\u2013 \u5956\u52b1\u4e0e\u6807\u7b7e\u5b8c\u5168\u5339\u914d\u7684\u7b54\u6848\u3002<\/li>\n<li><strong>int_reward_func<\/strong>\u00a0\u2013 \u9f13\u52b1\u4ec5\u4e3a\u6574\u6570\u7684\u7b54\u6848\u3002<\/li>\n<li><strong>soft_format_reward_func<\/strong>\u00a0\u2013 \u68c0\u67e5\u7ed3\u6784\u4f46\u5141\u8bb8\u8f7b\u5fae\u7684\u6362\u884c\u4e0d\u5339\u914d\u3002<\/li>\n<li><strong>strict_format_reward_func<\/strong>\u00a0\u2013 \u786e\u4fdd\u54cd\u5e94\u7ed3\u6784\u4e0e\u63d0\u793a\u5339\u914d\uff0c\u5305\u62ec\u6362\u884c\u7b26\u3002<\/li>\n<li><strong>xmlcount_reward_func<\/strong>\u00a0\u2013 \u786e\u4fdd\u54cd\u5e94\u4e2d\u6bcf\u4e2a XML \u6807\u7b7e\u53ea\u51fa\u73b0\u4e00\u6b21\u3002<\/li>\n<\/ul>\n<h2>\ud83e\uddee \u4f7f\u7528 vLLM<\/h2>\n<p>\u4f60\u73b0\u5728\u53ef\u4ee5\u76f4\u63a5\u5728\u4f60\u7684\u5fae\u8c03\u5806\u6808\u4e2d\u4f7f\u7528\u00a0vLLM\uff0c\u8fd9\u53ef\u4ee5\u5e26\u6765\u66f4\u9ad8\u7684\u541e\u5410\u91cf\uff0c\u5e76\u5141\u8bb8\u4f60\u540c\u65f6\u5bf9\u6a21\u578b\u8fdb\u884c\u5fae\u8c03\u548c\u63a8\u7406\uff01\u5728 1x A100 40GB \u4e0a\uff0c\u4f7f\u7528 Unsloth \u5bf9 Llama 3.2 3B Instruct \u7684\u52a8\u6001 4bit \u91cf\u5316\uff0c\u9884\u8ba1\u53ef\u8fbe\u5230\u7ea6 4000 <a href=\"https:\/\/www.kdjingpai.com\/en\/tokenization\/\">tokens<\/a> \/ \u79d2\u3002\u5728 16GB \u7684 Tesla T4 (\u514d\u8d39\u7684 Colab GPU) \u4e0a\uff0c\u4f60\u53ef\u4ee5\u8fbe\u5230 300 tokens \/ \u79d2\u3002<\/p>\n<p>\u6211\u4eec\u8fd8\u795e\u5947\u5730\u6d88\u9664\u4e86\u540c\u65f6\u52a0\u8f7d <a href=\"https:\/\/www.kdjingpai.com\/en\/vllm\/\">vLLM<\/a> \u548c Unsloth \u65f6\u7684\u53cc\u500d\u5185\u5b58\u4f7f\u7528\uff0c\u4e3a Llama 3.1 8B \u8282\u7701\u4e86\u7ea6 5GB\uff0c\u4e3a Llama 3.2 3B \u8282\u7701\u4e86 3GB\u3002Unsloth \u539f\u672c\u53ef\u4ee5\u5728 1x 48GB GPU \u4e2d\u5fae\u8c03 Llama 3.3 70B Instruct\uff0c\u5176\u4e2d Llama 3.3 70B \u7684\u6743\u91cd\u5360\u7528 40GB VRAM\u3002\u5982\u679c\u6211\u4eec\u4e0d\u6d88\u9664\u53cc\u500d\u5185\u5b58\u4f7f\u7528\uff0c\u90a3\u4e48\u540c\u65f6\u52a0\u8f7d Unsloth \u548c vLLM \u65f6\u5c06\u9700\u8981 &gt;= 80GB \u7684 VRAM\u3002<\/p>\n<p>\u4f46\u4f7f\u7528 Unsloth\uff0c\u4f60\u4ecd\u7136\u53ef\u4ee5\u5728\u4e0d\u5230 48GB \u7684 VRAM \u4e2d\uff0c\u5728\u4e00\u4e2a\u5305\u91cc\u540c\u65f6\u8fdb\u884c\u5fae\u8c03\u5e76\u4eab\u53d7\u5feb\u901f\u63a8\u7406\u7684\u597d\u5904\uff01\u8981\u4f7f\u7528\u5feb\u901f\u63a8\u7406\uff0c\u9996\u5148\u5b89\u88c5 vllm\uff0c\u5e76\u4f7f\u7528\u00a0<code>fast_inference<\/code>\u00a0\u5b9e\u4f8b\u5316 Unsloth\uff1a<\/p>\n<p>\u590d\u5236<\/p>\n<pre><code>pip install unsloth vllm\r\nfrom unsloth import FastLanguageModel\r\nmodel, tokenizer = FastLanguageModel.from_pretrained(\r\nmodel_name = \"unsloth\/Llama-3.2-3B-Instruct\",\r\nfast_inference = True,\r\n)\r\nmodel.fast_generate([\"Hello!\"])\r\n<\/code><\/pre>\n<h2>\u2705 GRPO \u8981\u6c42\u6307\u5357<\/h2>\n<p>\u5f53\u4f60\u4f7f\u7528 Unsloth \u8fdb\u884c GRPO \u65f6\uff0c\u4e0e\u4f7f\u7528 Flash Attention 2 \u7684\u6807\u51c6\u5b9e\u73b0\u76f8\u6bd4\uff0c\u6211\u4eec\u901a\u8fc7\u591a\u79cd\u6280\u5de7\u667a\u80fd\u5730\u5c06 VRAM \u4f7f\u7528\u91cf\u51cf\u5c11\u4e86 90% \u4ee5\u4e0a\uff01\u4f8b\u5982\uff0c\u5728 20K \u4e0a\u4e0b\u6587\u957f\u5ea6\u4e0b\uff0c\u6bcf\u4e2a\u63d0\u793a\u751f\u6210 8 \u4e2a\u56de\u590d\uff0cUnsloth \u5bf9 Llama 3.1 8B \u4ec5\u4f7f\u7528\u00a0<strong>54.3GB \u7684 VRAM<\/strong>\uff0c\u800c\u6807\u51c6\u5b9e\u73b0\u5219\u9700\u8981\u00a0<strong>510.8GB (Unsloth \u51cf\u5c11\u4e86 90%)<\/strong>\u3002<\/p>\n<ol>\n<li>\u5173\u4e8e GRPO \u7684\u00a0<strong>QLoRA 4-bit \u7684 GPU VRAM \u8981\u6c42<\/strong>\uff0c\u4e00\u822c\u89c4\u5219\u662f\u6a21\u578b\u53c2\u6570 = \u4f60\u5c06\u9700\u8981\u7684 VRAM \u6570\u91cf (\u4f60\u53ef\u4ee5\u4f7f\u7528\u66f4\u5c11\u7684 VRAM\uff0c\u4f46\u8fd9\u53ea\u662f\u4e3a\u4e86\u5b89\u5168\u8d77\u89c1)\u3002\u4f60\u8bbe\u7f6e\u7684\u4e0a\u4e0b\u6587\u957f\u5ea6\u8d8a\u957f\uff0c\u9700\u8981\u7684 VRAM \u5c31\u8d8a\u591a\u3002LoRA 16-bit \u81f3\u5c11\u4f1a\u4f7f\u7528 4 \u500d\u4ee5\u4e0a\u7684 VRAM\u3002<\/li>\n<li>\u6211\u4eec\u7528\u4e8e GRPO \u7684\u65b0\u7684\u5185\u5b58\u9ad8\u6548\u7ebf\u6027\u6838\u5c06\u5185\u5b58\u4f7f\u7528\u91cf\u51cf\u5c11\u4e86 8 \u500d\u6216\u66f4\u591a\u3002\u8fd9\u8282\u7701\u4e86 68.5GB \u7684\u5185\u5b58\uff0c\u540c\u65f6\u5728\u00a0<code>torch.compile<\/code>\u00a0\u7684\u5e2e\u52a9\u4e0b\u5b9e\u9645\u4e0a\u66f4\u5feb\uff01<\/li>\n<li>\u6211\u4eec\u5229\u7528\u4e86\u6211\u4eec\u4e0d\u4e45\u524d\u53d1\u5e03\u7684\u667a\u80fd\u00a0Unsloth \u68af\u5ea6\u68c0\u67e5\u70b9\u00a0\u7b97\u6cd5\u3002\u5b83\u667a\u80fd\u5730\u5c06\u4e2d\u95f4\u6fc0\u6d3b\u5f02\u6b65\u5378\u8f7d\u5230\u7cfb\u7edf RAM\uff0c\u540c\u65f6\u901f\u5ea6\u4ec5\u6162 1%\u3002\u8fd9\u8282\u7701\u4e86 52GB \u7684\u5185\u5b58\u3002<\/li>\n<li>\u4e0e\u5176\u4ed6\u5305\u4e2d\u7684\u5b9e\u73b0\u4e0d\u540c\uff0cUnsloth \u8fd8\u4e0e\u5e95\u5c42\u63a8\u7406\u5f15\u64ce (vLLM) \u4f7f\u7528\u76f8\u540c\u7684 GPU \/ CUDA \u5185\u5b58\u7a7a\u95f4\u3002\u8fd9\u8282\u7701\u4e86 16GB \u7684\u5185\u5b58\u3002<\/li>\n<\/ol>\n<table>\n<thead>\n<tr>\n<th align=\"left\">\u6307\u6807<\/th>\n<th align=\"left\">Unsloth<\/th>\n<th align=\"left\">\u6807\u51c6 + FA2<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td align=\"left\"><strong>\u8bad\u7ec3\u5185\u5b58\u6210\u672c (GB)<\/strong><\/td>\n<td align=\"left\">42GB<\/td>\n<td align=\"left\">414GB<\/td>\n<\/tr>\n<tr>\n<td align=\"left\"><strong>GRPO \u5185\u5b58\u6210\u672c (GB)<\/strong><\/td>\n<td align=\"left\">9.8GB<\/td>\n<td align=\"left\">78.3GB<\/td>\n<\/tr>\n<tr>\n<td align=\"left\"><strong>\u63a8\u7406\u6210\u672c (GB)<\/strong><\/td>\n<td align=\"left\">0GB<\/td>\n<td align=\"left\">16GB<\/td>\n<\/tr>\n<tr>\n<td align=\"left\"><strong>20K \u4e0a\u4e0b\u6587\u957f\u5ea6\u7684\u63a8\u7406 KV \u7f13\u5b58 (GB)<\/strong><\/td>\n<td align=\"left\">2.5GB<\/td>\n<td align=\"left\">2.5GB<\/td>\n<\/tr>\n<tr>\n<td align=\"left\"><strong>\u603b\u5185\u5b58\u4f7f\u7528\u91cf<\/strong><\/td>\n<td align=\"left\"><strong>54.33GB (\u51cf\u5c11 90%)<\/strong><\/td>\n<td align=\"left\"><strong>510.8GB<\/strong><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>\u5728\u5178\u578b\u7684\u6807\u51c6 GRPO \u5b9e\u73b0\u4e2d\uff0c\u4f60\u9700\u8981\u521b\u5efa 2 \u4e2a\u5927\u5c0f\u4e3a (8, 20K) \u7684 logits \u6765\u8ba1\u7b97 GRPO \u635f\u5931\u3002\u8fd9\u5728 VRAM \u4e2d\u5360\u7528\u00a0<code>2 * 2 \u5b57\u8282 * 8 (\u751f\u6210\u6570\u91cf) * 20K (\u4e0a\u4e0b\u6587\u957f\u5ea6) * 128256 (\u8bcd\u6c47\u8868\u5927\u5c0f) = 78.3GB<\/code>\u3002<\/p>\n<p>Unsloth \u4e3a\u957f\u4e0a\u4e0b\u6587 GRPO \u8282\u7701\u4e86 8 \u500d\u7684\u5185\u5b58\u4f7f\u7528\uff0c\u56e0\u6b64\u5bf9\u4e8e 20K \u4e0a\u4e0b\u6587\u957f\u5ea6\uff0c\u6211\u4eec\u53ea\u9700\u8981\u989d\u5916\u7684 9.8GB VRAM\uff01<\/p>\n<p>\u6211\u4eec\u8fd8\u9700\u8981\u4ee5 16bit \u683c\u5f0f\u5904\u7406 KV \u7f13\u5b58\u3002Llama 3.1 8B \u6709 32 \u5c42\uff0cK \u548c V \u7684\u5927\u5c0f\u90fd\u662f 1024\u3002\u6240\u4ee5 20K \u4e0a\u4e0b\u6587\u957f\u5ea6\u7684\u5185\u5b58\u4f7f\u7528\u91cf =\u00a0<code>2 * 2 \u5b57\u8282 * 32 \u5c42 * 20K \u4e0a\u4e0b\u6587\u957f\u5ea6 * 1024 = \u6bcf\u4e2a\u6279\u6b21 2.5GB<\/code>\u3002\u6211\u4eec\u4f1a\u5c06 vLLM \u7684\u6279\u5904\u7406\u5927\u5c0f\u8bbe\u7f6e\u4e3a 8\uff0c\u4f46\u4e3a\u4e86\u8282\u7701 VRAM\uff0c\u5728\u6211\u4eec\u7684\u8ba1\u7b97\u4e2d\u6211\u4eec\u5c06\u5176\u4fdd\u7559\u4e3a 1\u3002\u5426\u5219\uff0c\u4f60\u5c06\u9700\u8981 20GB \u7528\u4e8e KV \u7f13\u5b58\u3002<\/p>\n","protected":false},"excerpt":{"rendered":"<p>\u5b66\u4e60\u5173\u4e8e\u5f3a\u5316\u5b66\u4e60 (RL) \u7684\u6240\u6709\u77e5\u8bc6\uff0c\u4ee5\u53ca\u5982\u4f55\u4f7f\u7528 Unsloth \u548c GRPO \u8bad\u7ec3\u4f60\u81ea\u5df1\u7684 DeepSeek-R1 \u63a8\u7406\u6a21\u578b\u3002\u4e00\u4efd\u4ece\u5165\u95e8\u5230\u7cbe\u901a\u7684\u5b8c\u6574\u6307\u5357\u3002 \ud83e\udda5 \u4f60\u5c06\u5b66\u5230\u4ec0\u4e48 \u4ec0\u4e48\u662f RL\uff1fRLVR\uff1fPPO\uff1fGRPO\uff1fRLHF\uff1fRFT\uff1f&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[34],"tags":[],"class_list":["post-31788","post","type-post","status-publish","format-standard","hentry","category-knowledge"],"_links":{"self":[{"href":"https:\/\/www.kdjingpai.com\/ja\/wp-json\/wp\/v2\/posts\/31788","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.kdjingpai.com\/ja\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.kdjingpai.com\/ja\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.kdjingpai.com\/ja\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.kdjingpai.com\/ja\/wp-json\/wp\/v2\/comments?post=31788"}],"version-history":[{"count":0,"href":"https:\/\/www.kdjingpai.com\/ja\/wp-json\/wp\/v2\/posts\/31788\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.kdjingpai.com\/ja\/wp-json\/wp\/v2\/media?parent=31788"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.kdjingpai.com\/ja\/wp-json\/wp\/v2\/categories?post=31788"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.kdjingpai.com\/ja\/wp-json\/wp\/v2\/tags?post=31788"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}