{"id":30219,"date":"2025-04-09T15:02:27","date_gmt":"2025-04-09T07:02:27","guid":{"rendered":"https:\/\/www.aisharenet.com\/?p=30219"},"modified":"2025-04-09T15:02:50","modified_gmt":"2025-04-09T07:02:50","slug":"jingtong-crawl4aiba","status":"publish","type":"post","link":"https:\/\/www.kdjingpai.com\/pt\/jingtong-crawl4aiba\/","title":{"rendered":"\u7cbe\u901a Crawl4AI\uff1a\u4e3a LLM \u548c RAG \u51c6\u5907\u9ad8\u8d28\u91cf\u7f51\u9875\u6570\u636e"},"content":{"rendered":"<p>\u4f20\u7edf\u7f51\u7edc\u722c\u866b\u6846\u67b6\u529f\u80fd\u591a\u6837\uff0c\u4f46\u5728\u5904\u7406\u6570\u636e\u65f6\u5e38\u9700\u8981\u989d\u5916\u8fdb\u884c\u6e05\u6d17\u4e0e\u683c\u5f0f\u5316\uff0c\u8fd9\u4f7f\u5f97\u5b83\u4eec\u4e0e\u5927\u8bed\u8a00\u6a21\u578b\uff08LLM\uff09\u7684\u96c6\u6210\u76f8\u5bf9\u590d\u6742\u3002\u8bb8\u591a\u5de5\u5177\u7684\u8f93\u51fa\uff08\u5982\u539f\u59cb\u00a0<code>HTML<\/code>\u00a0\u6216\u672a\u7ed3\u6784\u5316\u7684\u00a0<code>JSON<\/code>\uff09\u5305\u542b\u5927\u91cf\u566a\u58f0\uff0c\u4e0d\u9002\u5408\u76f4\u63a5\u7528\u4e8e\u68c0\u7d22\u589e\u5f3a\u751f\u6210\uff08RAG\uff09\u7b49\u573a\u666f\uff0c\u56e0\u4e3a\u8fd9\u4f1a\u964d\u4f4e\u00a0<code>LLM<\/code>\u00a0\u5904\u7406\u7684\u6548\u7387\u548c\u51c6\u786e\u6027\u3002<\/p>\n<p><a href=\"https:\/\/www.kdjingpai.com\/crawl4ai\/\">Crawl4AI<\/a>\u00a0\u63d0\u4f9b\u4e86\u4e00\u79cd\u4e0d\u540c\u7684\u89e3\u51b3\u65b9\u6848\u3002\u5b83\u4e13\u6ce8\u4e8e\u76f4\u63a5\u751f\u6210\u5e72\u51c0\u3001\u7ed3\u6784\u5316\u7684\u00a0<code>Markdown<\/code>\u00a0\u683c\u5f0f\u5185\u5bb9\u3002\u8fd9\u79cd\u683c\u5f0f\u4fdd\u7559\u4e86\u539f\u6587\u7684\u8bed\u4e49\u7ed3\u6784\uff08\u5982\u6807\u9898\u3001\u5217\u8868\u3001\u4ee3\u7801\u5757\uff09\uff0c\u540c\u65f6\u667a\u80fd\u5730\u53bb\u9664\u4e86\u5bfc\u822a\u3001\u5e7f\u544a\u3001\u9875\u811a\u7b49\u65e0\u5173\u5143\u7d20\uff0c\u975e\u5e38\u9002\u5408\u4f5c\u4e3a\u00a0<code>LLM<\/code>\u00a0\u7684\u8f93\u5165\u6216\u7528\u4e8e\u6784\u5efa\u9ad8\u8d28\u91cf\u7684\u00a0<code><a href=\"https:\/\/www.kdjingpai.com\/rag\/\">RAG<\/a><\/code>\u00a0\u6570\u636e\u96c6\u3002<code><a href=\"https:\/\/www.kdjingpai.com\/crawl4ai\/\">Crawl4AI<\/a><\/code>\u00a0\u662f\u4e00\u4e2a\u5b8c\u5168\u5f00\u6e90\u7684\u9879\u76ee\uff0c\u4f7f\u7528\u65f6\u4e0d\u9700\u8981\u00a0<code>API<\/code>\u00a0\u5bc6\u94a5\uff0c\u4e5f\u6ca1\u6709\u8bbe\u7f6e\u4ed8\u8d39\u95e8\u69db\u3002<\/p>\n<h2>\u5b89\u88c5\u548c\u914d\u7f6e<\/h2>\n<p>\u5efa\u8bae\u4f7f\u7528\u00a0<a href=\"https:\/\/www.kdjingpai.com\/uvminglingxiangjie\/\">uv<\/a>\u00a0\u521b\u5efa\u5e76\u6fc0\u6d3b\u4e00\u4e2a\u72ec\u7acb\u7684\u00a0<code>Python<\/code>\u00a0\u865a\u62df\u73af\u5883\u6765\u7ba1\u7406\u9879\u76ee\u4f9d\u8d56\u3002<code>uv<\/code>\u00a0\u662f\u4e00\u4e2a\u57fa\u4e8e\u00a0<code>Rust<\/code>\u00a0\u5f00\u53d1\u7684\u65b0\u5174\u00a0<code>Python<\/code>\u00a0\u5305\u7ba1\u7406\u5668\uff0c\u4ee5\u5176\u663e\u8457\u7684\u901f\u5ea6\u4f18\u52bf\uff08\u901a\u5e38\u6bd4\u00a0<code>pip<\/code>\u00a0\u5feb 3-5 \u500d\uff09\u548c\u9ad8\u6548\u7684\u5e76\u884c\u4f9d\u8d56\u89e3\u6790\u80fd\u529b\u800c\u53d7\u5230\u5173\u6ce8\u3002<\/p>\n<pre><code># \u521b\u5efa\u865a\u62df\u73af\u5883\r\nuv venv crawl4ai-env\r\n# \u6fc0\u6d3b\u73af\u5883\r\n# Windows\r\n# crawl4ai-env\\Scripts\\activate\r\n# macOS\/Linux\r\nsource crawl4ai-env\/bin\/activate\r\n<\/code><\/pre>\n<p>\u73af\u5883\u6fc0\u6d3b\u540e\uff0c\u4f7f\u7528\u00a0<code>uv<\/code>\u00a0\u5b89\u88c5\u00a0<code>Crawl4AI<\/code>\u00a0\u6838\u5fc3\u5e93\uff1a<\/p>\n<pre><code>uv pip install crawl4ai\r\n<\/code><\/pre>\n<p>\u5b89\u88c5\u5b8c\u6210\u540e\uff0c\u8fd0\u884c\u521d\u59cb\u5316\u547d\u4ee4\uff0c\u8be5\u547d\u4ee4\u4f1a\u8d1f\u8d23\u5b89\u88c5\u6216\u66f4\u65b0\u00a0<code>Playwright<\/code>\u00a0\u6240\u9700\u7684\u6d4f\u89c8\u5668\u9a71\u52a8\uff08\u5982\u00a0<code>Chromium<\/code>\uff09\uff0c\u5e76\u6267\u884c\u73af\u5883\u68c0\u67e5\u3002<code>Playwright<\/code>\u00a0\u662f\u4e00\u4e2a\u7531\u00a0<code>Microsoft<\/code>\u00a0\u5f00\u53d1\u7684\u6d4f\u89c8\u5668\u81ea\u52a8\u5316\u5e93\uff0c<code>Crawl4AI<\/code>\u00a0\u5229\u7528\u5b83\u6765\u6a21\u62df\u771f\u5b9e\u7528\u6237\u4ea4\u4e92\uff0c\u4ece\u800c\u80fd\u591f\u5904\u7406\u52a8\u6001\u52a0\u8f7d\u5185\u5bb9\u7684\u00a0<code>JavaScript<\/code>\u00a0\u91cd\u5ea6\u7f51\u7ad9\u3002<\/p>\n<pre><code>crawl4ai-setup\r\n<\/code><\/pre>\n<p>\u5982\u679c\u9047\u5230\u6d4f\u89c8\u5668\u9a71\u52a8\u76f8\u5173\u95ee\u9898\uff0c\u53ef\u4ee5\u5c1d\u8bd5\u624b\u52a8\u5b89\u88c5\uff1a<\/p>\n<pre><code># \u624b\u52a8\u5b89\u88c5 Playwright \u6d4f\u89c8\u5668\u53ca\u4f9d\u8d56\r\npython -m playwright install --with-deps chromium\r\n<\/code><\/pre>\n<p>\u6839\u636e\u9700\u8981\uff0c\u53ef\u4ee5\u901a\u8fc7\u00a0<code>uv<\/code>\u00a0\u5b89\u88c5\u5305\u542b\u989d\u5916\u529f\u80fd\u7684\u6269\u5c55\u5305\uff1a<\/p>\n<pre><code># \u5b89\u88c5\u6587\u672c\u805a\u7c7b\u529f\u80fd (\u4f9d\u8d56 PyTorch)\r\nuv pip install \"crawl4ai[torch]\"\r\n# \u5b89\u88c5 Transformers \u652f\u6301 (\u7528\u4e8e\u672c\u5730 AI \u6a21\u578b)\r\nuv pip install \"crawl4ai[<a href=\"https:\/\/www.kdjingpai.com\/transformer\/\">transformer<\/a>]\"\r\n# \u5b89\u88c5\u6240\u6709\u53ef\u9009\u529f\u80fd\r\nuv pip install \"crawl4ai[all]\"\r\n<\/code><\/pre>\n<h2>\u57fa\u7840\u722c\u53d6\u5b9e\u4f8b<\/h2>\n<p>\u4ee5\u4e0b\u00a0<code>Python<\/code>\u00a0\u811a\u672c\u5c55\u793a\u4e86\u00a0<code>Crawl4AI<\/code>\u00a0\u7684\u57fa\u672c\u7528\u6cd5\uff1a\u722c\u53d6\u5355\u4e2a\u7f51\u9875\u5e76\u5c06\u5176\u8f6c\u6362\u4e3a\u00a0<code>Markdown<\/code>\u3002<\/p>\n<pre><code>import asyncio\r\nfrom crawl4ai import AsyncWebCrawler\r\nasync def main():\r\n# \u521d\u59cb\u5316\u5f02\u6b65\u722c\u866b\r\nasync with AsyncWebCrawler() as crawler:\r\n# \u6267\u884c\u722c\u53d6\u4efb\u52a1\r\nresult = await crawler.arun(\r\nurl=\"https:\/\/www.sitepoint.com\/react-router-complete-guide\/\"\r\n)\r\n# \u68c0\u67e5\u722c\u53d6\u662f\u5426\u6210\u529f\r\nif result.success:\r\n# \u8f93\u51fa\u7ed3\u679c\u4fe1\u606f\r\nprint(f\"\u6807\u9898: {result.title}\")\r\nprint(f\"\u63d0\u53d6\u7684 Markdown ({len(result.markdown)} \u5b57\u7b26):\")\r\n# \u4ec5\u663e\u793a\u524d 300 \u4e2a\u5b57\u7b26\u4f5c\u4e3a\u9884\u89c8\r\nprint(result.markdown[:300] + \"...\")\r\n# \u5c06\u5b8c\u6574\u7684 Markdown \u5185\u5bb9\u4fdd\u5b58\u5230\u6587\u4ef6\r\nwith open(\"example_content.md\", \"w\", encoding=\"utf-8\") as f:\r\nf.write(result.markdown)\r\nprint(f\"\u5185\u5bb9\u5df2\u4fdd\u5b58\u5230 example_content.md\")\r\nelse:\r\n# \u8f93\u51fa\u9519\u8bef\u4fe1\u606f\r\nprint(f\"\u722c\u53d6\u5931\u8d25: {result.url}\")\r\nprint(f\"\u72b6\u6001\u7801: {result.status_code}\")\r\nprint(f\"\u9519\u8bef\u4fe1\u606f: {result.error_message}\")\r\nif __name__ == \"__main__\":\r\nasyncio.run(main())\r\n<\/code><\/pre>\n<p>\u6267\u884c\u6b64\u811a\u672c\u540e\uff0c<code>Crawl4AI<\/code>\u00a0\u4f1a\u542f\u52a8\u00a0<code>Playwright<\/code>\u00a0\u63a7\u5236\u7684\u6d4f\u89c8\u5668\u8bbf\u95ee\u6307\u5b9a\u00a0<code>URL<\/code>\uff0c\u6267\u884c\u9875\u9762\u00a0<code>JavaScript<\/code>\uff0c\u7136\u540e\u667a\u80fd\u8bc6\u522b\u5e76\u63d0\u53d6\u4e3b\u8981\u5185\u5bb9\u533a\u57df\uff0c\u8fc7\u6ee4\u5e72\u6270\u5143\u7d20\uff0c\u6700\u7ec8\u751f\u6210\u5e72\u51c0\u7684\u00a0<code>Markdown<\/code>\u00a0\u6587\u4ef6\u3002<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.kdjingpai.com\/wp-content\/uploads\/2025\/04\/c98a8e1916931d6.png\" alt=\"e5d330f6-8e23-4a57-9739-bb4be1a5323d.png\" \/><\/p>\n<h2>\u6279\u91cf\u4e0e\u5e76\u884c\u722c\u53d6<\/h2>\n<p>\u5904\u7406\u591a\u4e2a\u00a0<code>URL<\/code>\u00a0\u65f6\uff0c<code>Crawl4AI<\/code>\u00a0\u7684\u5e76\u884c\u5904\u7406\u80fd\u529b\u53ef\u4ee5\u5927\u5e45\u63d0\u5347\u6548\u7387\u3002\u901a\u8fc7\u914d\u7f6e\u00a0<code>CrawlerRunConfig<\/code>\u00a0\u4e2d\u7684\u00a0<code>concurrency<\/code>\u00a0\u53c2\u6570\uff0c\u53ef\u4ee5\u63a7\u5236\u540c\u65f6\u5904\u7406\u7684\u9875\u9762\u6570\u91cf\u3002<\/p>\n<pre><code>import asyncio\r\nfrom crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode\r\nasync def main():\r\nurls = [\r\n\"https:\/\/example.com\/page1\",\r\n\"https:\/\/example.com\/page2\",\r\n\"https:\/\/example.com\/page3\",\r\n# \u6dfb\u52a0\u66f4\u591a URL...\r\n]\r\n# \u6d4f\u89c8\u5668\u914d\u7f6e\uff1a\u65e0\u5934\u6a21\u5f0f\uff0c\u589e\u52a0\u8d85\u65f6\r\nbrowser_config = BrowserConfig(\r\nheadless=True,\r\ntimeout=45000, # 45\u79d2\u8d85\u65f6\r\n)\r\n# \u722c\u53d6\u8fd0\u884c\u914d\u7f6e\uff1a\u8bbe\u7f6e\u5e76\u53d1\u6570\uff0c\u7981\u7528\u7f13\u5b58\u4ee5\u83b7\u53d6\u6700\u65b0\u5185\u5bb9\r\nrun_config = CrawlerRunConfig(\r\nconcurrency=5,  # \u540c\u65f6\u5904\u7406 5 \u4e2a\u9875\u9762\r\ncache_mode=CacheMode.BYPASS # \u7981\u7528\u7f13\u5b58\r\n)\r\nresults = []\r\nasync with AsyncWebCrawler(browser_config=browser_config) as crawler:\r\n# \u4f7f\u7528 arun_many \u8fdb\u884c\u6279\u91cf\u5e76\u884c\u722c\u53d6\r\n# \u6ce8\u610f\uff1aarun_many \u9700\u8981\u5c06 run_config \u5217\u8868\u4f20\u9012\u7ed9 configs \u53c2\u6570\r\n# \u5982\u679c\u6240\u6709 URL \u4f7f\u7528\u76f8\u540c\u914d\u7f6e\uff0c\u53ef\u4ee5\u521b\u5efa\u4e00\u4e2a\u914d\u7f6e\u5217\u8868\r\nconfigs = [run_config.clone(url=url) for url in urls] # \u4e3a\u6bcf\u4e2aURL\u514b\u9686\u914d\u7f6e\u5e76\u8bbe\u7f6eURL\r\n# arun_many \u8fd4\u56de\u4e00\u4e2a\u5f02\u6b65\u751f\u6210\u5668\r\nasync for result in crawler.arun_many(configs=configs):\r\nif result.success:\r\nresults.append(result)\r\nprint(f\"\u5df2\u5b8c\u6210: {result.url}, \u83b7\u53d6\u4e86 {len(result.markdown)} \u5b57\u7b26\")\r\nelse:\r\nprint(f\"\u5931\u8d25: {result.url}, \u9519\u8bef: {result.error_message}\")\r\n# \u5c06\u6240\u6709\u6210\u529f\u7684\u7ed3\u679c\u5408\u5e76\u5230\u4e00\u4e2a\u6587\u4ef6\r\nwith open(\"combined_results.md\", \"w\", encoding=\"utf-8\") as f:\r\nfor i, result in enumerate(results):\r\nf.write(f\"## {result.title}\\n\\n\")\r\nf.write(result.markdown)\r\nf.write(\"\\n\\n---\\n\\n\")\r\nprint(f\"\u6240\u6709\u6210\u529f\u5185\u5bb9\u5df2\u5408\u5e76\u4fdd\u5b58\u5230 combined_results.md\")\r\nif __name__ == \"__main__\":\r\nasyncio.run(main())\r\n<\/code><\/pre>\n<p><strong>\u6ce8\u610f<\/strong>: \u4e0a\u8ff0\u4ee3\u7801\u4f7f\u7528\u4e86\u00a0<code>arun_many<\/code>\u00a0\u65b9\u6cd5\uff0c\u8fd9\u662f\u5904\u7406\u5927\u91cf URL \u5217\u8868\u7684\u63a8\u8350\u65b9\u5f0f\uff0c\u5b83\u6bd4\u5faa\u73af\u8c03\u7528\u00a0<code>arun<\/code>\u00a0\u66f4\u9ad8\u6548\u3002<code>arun_many<\/code>\u00a0\u9700\u8981\u4e00\u4e2a\u914d\u7f6e\u5217\u8868\uff0c\u6bcf\u4e2a\u914d\u7f6e\u5bf9\u5e94\u4e00\u4e2a\u00a0<code>URL<\/code>\u3002\u5982\u679c\u6240\u6709\u00a0<code>URL<\/code>\u00a0\u4f7f\u7528\u76f8\u540c\u7684\u57fa\u672c\u914d\u7f6e\uff0c\u53ef\u4ee5\u901a\u8fc7\u00a0<code>clone()<\/code>\u00a0\u65b9\u6cd5\u521b\u5efa\u526f\u672c\u5e76\u8bbe\u7f6e\u7279\u5b9a\u00a0<code>URL<\/code>\u3002<\/p>\n<h2>\u7ed3\u6784\u5316\u6570\u636e\u63d0\u53d6 (\u57fa\u4e8e\u9009\u62e9\u5668)<\/h2>\n<p>\u9664\u4e86\u00a0<code>Markdown<\/code>\uff0c<code>Crawl4AI<\/code>\u00a0\u8fd8\u80fd\u4f7f\u7528\u00a0<code>CSS<\/code>\u00a0\u9009\u62e9\u5668\u6216\u00a0<code>XPath<\/code>\u00a0\u63d0\u53d6\u7ed3\u6784\u5316\u6570\u636e\uff0c\u975e\u5e38\u9002\u5408\u6570\u636e\u683c\u5f0f\u89c4\u6574\u7684\u7f51\u7ad9\u3002<\/p>\n<pre><code>import asyncio\r\nimport json\r\nfrom crawl4ai import AsyncWebCrawler, ExtractorConfig\r\nasync def main():\r\n# \u5b9a\u4e49\u63d0\u53d6\u89c4\u5219 (CSS \u9009\u62e9\u5668)\r\nextractor_config = ExtractorConfig(\r\nstrategy=\"css\", # \u660e\u786e\u6307\u5b9a\u7b56\u7565\u4e3a CSS\r\nrules={\r\n\"products\": {\r\n\"selector\": \"div.product-card\", # \u4e3b\u9009\u62e9\u5668\r\n\"type\": \"list\",\r\n\"properties\": {\r\n\"name\": {\"selector\": \"h2.product-title\", \"type\": \"text\"},\r\n\"price\": {\"selector\": \".price span\", \"type\": \"text\"},\r\n\"link\": {\"selector\": \"a.product-link\", \"type\": \"attribute\", \"attribute\": \"href\"}\r\n}\r\n}\r\n}\r\n)\r\nasync with AsyncWebCrawler() as crawler:\r\nresult = await crawler.arun(\r\nurl=\"https:\/\/example-shop.com\/products\",\r\nextractor_config=extractor_config\r\n)\r\nif result.success and result.extracted_data:\r\nextracted_data = result.extracted_data\r\nwith open(\"products.json\", \"w\", encoding=\"utf-8\") as f:\r\njson.dump(extracted_data, f, ensure_ascii=False, indent=2)\r\nprint(f\"\u5df2\u63d0\u53d6 {len(extracted_data.get('products', []))} \u4e2a\u4ea7\u54c1\u4fe1\u606f\")\r\nprint(\"\u6570\u636e\u5df2\u4fdd\u5b58\u5230 products.json\")\r\nelif not result.success:\r\nprint(f\"\u722c\u53d6\u5931\u8d25: {result.error_message}\")\r\nelse:\r\nprint(\"\u672a\u63d0\u53d6\u5230\u6570\u636e\u6216\u63d0\u53d6\u89c4\u5219\u5339\u914d\u5931\u8d25\")\r\nif __name__ == \"__main__\":\r\nasyncio.run(main())\r\n<\/code><\/pre>\n<p>\u8fd9\u79cd\u65b9\u5f0f\u65e0\u9700\u00a0<code>LLM<\/code>\u00a0\u4ecb\u5165\uff0c\u6210\u672c\u4f4e\u4e14\u901f\u5ea6\u5feb\uff0c\u9002\u7528\u4e8e\u76ee\u6807\u5143\u7d20\u660e\u786e\u7684\u573a\u666f\u3002<\/p>\n<h2>AI \u589e\u5f3a\u7684\u6570\u636e\u63d0\u53d6<\/h2>\n<p>\u5bf9\u4e8e\u7ed3\u6784\u590d\u6742\u6216\u65e0\u56fa\u5b9a\u6a21\u5f0f\u7684\u9875\u9762\uff0c\u53ef\u4ee5\u5229\u7528\u00a0<code>LLM<\/code>\u00a0\u8fdb\u884c\u667a\u80fd\u63d0\u53d6\u3002<\/p>\n<pre><code>import asyncio\r\nimport json\r\nfrom crawl4ai import AsyncWebCrawler, BrowserConfig, AIExtractorConfig\r\nasync def main():\r\n# \u914d\u7f6e AI \u63d0\u53d6\u5668\r\nai_config = AIExtractorConfig(\r\nprovider=\"openai\", # \u6216 \"local\", \"<a href=\"https:\/\/www.kdjingpai.com\/claudeanquanfubai\/\">anthropic<\/a>\" \u7b49\r\nmodel=\"gpt-4o-mini\", # \u4f7f\u7528 OpenAI \u7684\u6a21\u578b\r\n# api_key=\"YOUR_OPENAI_API_KEY\", # \u5982\u679c\u73af\u5883\u53d8\u91cf\u672a\u8bbe\u7f6e\uff0c\u5728\u6b64\u63d0\u4f9b\r\nschema={\r\n\"type\": \"object\",\r\n\"properties\": {\r\n\"article_summary\": {\"type\": \"string\", \"description\": \"A brief summary of the article.\"},\r\n\"key_topics\": {\"type\": \"array\", \"items\": {\"type\": \"string\"}, \"description\": \"List of main topics discussed.\"},\r\n\"sentiment\": {\"type\": \"string\", \"enum\": [\"positive\", \"negative\", \"neutral\"], \"description\": \"Overall sentiment of the article.\"}\r\n},\r\n\"required\": [\"article_summary\", \"key_topics\"]\r\n},\r\ninstruction=\"Extract the summary, key topics, and sentiment from the provided article text.\"\r\n)\r\nbrowser_config = BrowserConfig(timeout=60000) # AI \u5904\u7406\u53ef\u80fd\u9700\u8981\u66f4\u957f\u65f6\u95f4\r\nasync with AsyncWebCrawler(browser_config=browser_config) as crawler:\r\nresult = await crawler.arun(\r\nurl=\"https:\/\/example-news.com\/article\/complex-analysis\",\r\nai_extractor_config=ai_config\r\n)\r\nif result.success and result.ai_extracted:\r\nai_extracted = result.ai_extracted\r\nprint(\"AI \u63d0\u53d6\u7684\u6570\u636e:\")\r\nprint(json.dumps(ai_extracted, indent=2, ensure_ascii=False))\r\n# \u4e5f\u53ef\u4ee5\u9009\u62e9\u4fdd\u5b58\u5230\u6587\u4ef6\r\n# with open(\"ai_extracted_data.json\", \"w\", encoding=\"utf-8\") as f:\r\n#     json.dump(ai_extracted, f, ensure_ascii=False, indent=2)\r\nelif not result.success:\r\nprint(f\"\u722c\u53d6\u5931\u8d25: {result.error_message}\")\r\nelse:\r\nprint(\"AI \u672a\u80fd\u63d0\u53d6\u6240\u9700\u6570\u636e\u3002\")\r\nif __name__ == \"__main__\":\r\nasyncio.run(main())\r\n<\/code><\/pre>\n<p>AI \u63d0\u53d6\u63d0\u4f9b\u4e86\u6781\u5927\u7684\u7075\u6d3b\u6027\uff0c\u80fd\u591f\u7406\u89e3\u5185\u5bb9\u5e76\u6309\u9700\u751f\u6210\u7ed3\u6784\u5316\u8f93\u51fa\uff0c\u4f46\u4f1a\u4ea7\u751f\u989d\u5916\u7684\u00a0<code>API<\/code>\u00a0\u8c03\u7528\u6210\u672c\uff08\u5982\u679c\u4f7f\u7528\u4e91\u670d\u52a1\u00a0<code>LLM<\/code>\uff09\u548c\u5904\u7406\u65f6\u95f4\u3002\u9009\u62e9\u672c\u5730\u6a21\u578b\uff08\u5982\u00a0<code><a href=\"https:\/\/www.kdjingpai.com\/le-chat-mistral\/\">Mistral<\/a><\/code>,\u00a0<code>Llama<\/code>\uff09\u53ef\u4ee5\u964d\u4f4e\u6210\u672c\u5e76\u4fdd\u62a4\u9690\u79c1\uff0c\u4f46\u5bf9\u672c\u5730\u786c\u4ef6\u6709\u4e00\u5b9a\u8981\u6c42\u3002<\/p>\n<h2>\u8fdb\u9636\u914d\u7f6e\u4e0e\u6280\u5de7<\/h2>\n<p><code>Crawl4AI<\/code>\u00a0\u63d0\u4f9b\u4e30\u5bcc\u7684\u914d\u7f6e\u9009\u9879\u6765\u5e94\u5bf9\u590d\u6742\u573a\u666f\u3002<\/p>\n<h3>\u6d4f\u89c8\u5668\u914d\u7f6e (<code>BrowserConfig<\/code>)<\/h3>\n<p><code>BrowserConfig<\/code>\u00a0\u63a7\u5236\u6d4f\u89c8\u5668\u672c\u8eab\u7684\u542f\u52a8\u548c\u884c\u4e3a\u3002<\/p>\n<pre><code>from crawl4ai import BrowserConfig\r\nconfig = BrowserConfig(\r\nbrowser_type=\"firefox\",  # \u4f7f\u7528 Firefox \u6d4f\u89c8\u5668\r\nheadless=False,         # \u663e\u793a\u6d4f\u89c8\u5668\u754c\u9762\uff0c\u65b9\u4fbf\u8c03\u8bd5\r\nuser_agent=\"MyCustomCrawler\/1.0\", # \u8bbe\u7f6e\u81ea\u5b9a\u4e49 User-Agent\r\nproxy_config={          # \u914d\u7f6e\u4ee3\u7406\u670d\u52a1\u5668\r\n\"server\": \"http:\/\/proxy.example.com:8080\",\r\n\"username\": \"proxy_user\",\r\n\"password\": \"proxy_password\"\r\n},\r\nignore_https_errors=True, # \u5ffd\u7565 HTTPS \u8bc1\u4e66\u9519\u8bef (\u5f00\u53d1\u73af\u5883\u5e38\u7528)\r\nuse_persistent_context=True, # \u542f\u7528\u6301\u4e45\u5316\u4e0a\u4e0b\u6587\r\nuser_data_dir=\".\/my_browser_profile\", # \u6307\u5b9a\u7528\u6237\u6570\u636e\u76ee\u5f55\uff0c\u7528\u4e8e\u4fdd\u5b58 cookies, local storage \u7b49\r\ntimeout=60000,          # \u5168\u5c40\u6d4f\u89c8\u5668\u64cd\u4f5c\u8d85\u65f6 (\u6beb\u79d2)\r\nverbose=True            # \u6253\u5370\u66f4\u8be6\u7ec6\u7684\u65e5\u5fd7\r\n)\r\n# \u5728\u521d\u59cb\u5316 AsyncWebCrawler \u65f6\u4f20\u5165\r\n# async with AsyncWebCrawler(browser_config=config) as crawler:\r\n#    ...\r\n<\/code><\/pre>\n<h3>\u722c\u53d6\u8fd0\u884c\u914d\u7f6e (<code>CrawlerRunConfig<\/code>)<\/h3>\n<p><code>CrawlerRunConfig<\/code>\u00a0\u63a7\u5236\u5355\u6b21\u00a0<code>arun()<\/code>\u00a0\u6216\u00a0<code>arun_many()<\/code>\u00a0\u8c03\u7528\u7684\u5177\u4f53\u884c\u4e3a\u3002<\/p>\n<pre><code>from crawl4ai import CrawlerRunConfig, CacheMode\r\nrun_config = CrawlerRunConfig(\r\ncache_mode=CacheMode.READ_ONLY, # \u53ea\u8bfb\u7f13\u5b58\uff0c\u4e0d\u5199\u5165\u65b0\u7f13\u5b58\r\ncheck_robots_txt=True,      # \u68c0\u67e5\u5e76\u9075\u5b88 robots.txt \u89c4\u5219\r\nwait_until=\"networkidle\",   # \u7b49\u5f85\u7f51\u7edc\u7a7a\u95f2\u518d\u63d0\u53d6\uff0c\u9002\u5408JS\u52a8\u6001\u52a0\u8f7d\u5185\u5bb9\r\nwait_for=\"css:div#final-content\", # \u7b49\u5f85\u7279\u5b9a CSS \u9009\u62e9\u5668\u5143\u7d20\u51fa\u73b0\r\njs_code=\"window.scrollTo(0, document.body.scrollHeight);\", # \u9875\u9762\u52a0\u8f7d\u540e\u6267\u884c JS \u4ee3\u7801 (\u4f8b\u5982\u6eda\u52a8\u5230\u5e95\u90e8\u89e6\u53d1\u52a0\u8f7d)\r\nscan_full_page=True,        # \u5c1d\u8bd5\u81ea\u52a8\u6eda\u52a8\u9875\u9762\u4ee5\u52a0\u8f7d\u6240\u6709\u5185\u5bb9 (\u7528\u4e8e\u65e0\u9650\u6eda\u52a8)\r\nscreenshot=True,            # \u622a\u53d6\u9875\u9762\u622a\u56fe (\u7ed3\u679c\u5728 result.screenshot\uff0cBase64\u7f16\u7801)\r\npdf=True,                   # \u751f\u6210\u9875\u9762 PDF (\u7ed3\u679c\u5728 result.pdf\uff0cBase64\u7f16\u7801)\r\nword_count_threshold=50,    # \u8fc7\u6ee4\u6389\u5c11\u4e8e 50 \u4e2a\u5355\u8bcd\u7684\u6587\u672c\u5757\r\nexcluded_tags=[\"header\", \"nav\", \"footer\", \"aside\"], # \u4ece Markdown \u4e2d\u6392\u9664\u7279\u5b9a HTML \u6807\u7b7e\r\nexclude_external_links=True # \u4e0d\u63d0\u53d6\u5916\u90e8\u94fe\u63a5\r\n)\r\n# \u5728\u8c03\u7528 arun() \u6216\u521b\u5efa\u914d\u7f6e\u5217\u8868\u7ed9 arun_many() \u65f6\u4f20\u5165\r\n# result = await crawler.arun(url=\"...\", config=run_config)\r\n<\/code><\/pre>\n<h3>\u5904\u7406 JavaScript \u548c\u52a8\u6001\u5185\u5bb9<\/h3>\n<p>\u5f97\u76ca\u4e8e\u00a0<code>Playwright<\/code>\uff0c<code>Crawl4AI<\/code>\u00a0\u80fd\u5f88\u597d\u5730\u5904\u7406\u4f9d\u8d56\u00a0<code>JavaScript<\/code>\u00a0\u6e32\u67d3\u7684\u7f51\u7ad9\u3002\u5173\u952e\u914d\u7f6e\uff1a<\/p>\n<ul>\n<li><code>wait_until<\/code>: \u8bbe\u7f6e\u4e3a\u00a0<code>\"networkidle\"<\/code>\u00a0\u6216\u00a0<code>\"load\"<\/code>\u00a0\u901a\u5e38\u6bd4\u9ed8\u8ba4\u7684\u00a0<code>\"domcontentloaded\"<\/code>\u00a0\u66f4\u9002\u5408\u52a8\u6001\u9875\u9762\u3002<\/li>\n<li><code>wait_for<\/code>: \u7b49\u5f85\u7279\u5b9a\u5143\u7d20\u6216\u00a0<code>JavaScript<\/code>\u00a0\u6761\u4ef6\u6ee1\u8db3\u3002<\/li>\n<li><code>js_code<\/code>: \u5728\u9875\u9762\u52a0\u8f7d\u540e\u6267\u884c\u81ea\u5b9a\u4e49\u00a0<code>JavaScript<\/code>\uff0c\u4f8b\u5982\u70b9\u51fb\u6309\u94ae\u3001\u6eda\u52a8\u9875\u9762\u3002<\/li>\n<li><code>scan_full_page<\/code>: \u81ea\u52a8\u5904\u7406\u5e38\u89c1\u7684\u65e0\u9650\u6eda\u52a8\u9875\u9762\u3002<\/li>\n<li><code>delay_before_return_html<\/code>: \u5728\u63d0\u53d6\u524d\u589e\u52a0\u4e00\u4e2a\u77ed\u6682\u5ef6\u65f6\uff0c\u786e\u4fdd\u6240\u6709\u811a\u672c\u6267\u884c\u5b8c\u6bd5\u3002<\/li>\n<\/ul>\n<h3>\u9519\u8bef\u5904\u7406\u4e0e\u8c03\u8bd5<\/h3>\n<ul>\n<li><strong>\u68c0\u67e5\u00a0<code>result.success<\/code><\/strong>: \u6bcf\u6b21\u722c\u53d6\u540e\u52a1\u5fc5\u68c0\u67e5\u6b64\u5c5e\u6027\u3002<\/li>\n<li><strong>\u67e5\u770b\u00a0<code>result.status_code<\/code>\u00a0\u548c\u00a0<code>result.error_message<\/code><\/strong>: \u83b7\u53d6\u5931\u8d25\u539f\u56e0\u3002<\/li>\n<li><strong>\u8bbe\u7f6e\u00a0<code>headless=False<\/code><\/strong>: \u5728\u00a0<code>BrowserConfig<\/code>\u00a0\u4e2d\u8bbe\u7f6e\uff0c\u53ef\u4ee5\u89c2\u5bdf\u6d4f\u89c8\u5668\u64cd\u4f5c\uff0c\u76f4\u89c2\u8bca\u65ad\u95ee\u9898\u3002<\/li>\n<li><strong>\u542f\u7528\u00a0<code>verbose=True<\/code><\/strong>: \u5728\u00a0<code>BrowserConfig<\/code>\u00a0\u4e2d\u8bbe\u7f6e\uff0c\u83b7\u53d6\u66f4\u8be6\u7ec6\u7684\u8fd0\u884c\u65e5\u5fd7\u3002<\/li>\n<li><strong>\u4f7f\u7528\u00a0<code>try...except<\/code><\/strong>: \u5305\u88f9\u00a0<code>arun()<\/code>\u00a0\u6216\u00a0<code>arun_many()<\/code>\u00a0\u8c03\u7528\uff0c\u6355\u83b7\u53ef\u80fd\u51fa\u73b0\u7684\u00a0<code>Python<\/code>\u00a0\u5f02\u5e38\u3002<\/li>\n<\/ul>\n<pre><code>import asyncio\r\nfrom crawl4ai import AsyncWebCrawler, BrowserConfig\r\nasync def debug_crawl():\r\n# \u542f\u7528\u8c03\u8bd5\u6a21\u5f0f\uff1a\u663e\u793a\u6d4f\u89c8\u5668\uff0c\u6253\u5370\u8be6\u7ec6\u65e5\u5fd7\r\ndebug_browser_config = BrowserConfig(headless=False, verbose=True)\r\nasync with AsyncWebCrawler(browser_config=debug_browser_config) as crawler:\r\ntry:\r\nresult = await crawler.arun(url=\"https:\/\/problematic-site.com\")\r\nif not result.success:\r\nprint(f\"Crawl failed: {result.error_message} (Status: {result.status_code})\")\r\nelse:\r\nprint(\"Crawl successful.\")\r\n# ... process result ...\r\nexcept Exception as e:\r\nprint(f\"An unexpected error occurred: {e}\")\r\nif __name__ == \"__main__\":\r\nasyncio.run(debug_crawl())\r\n<\/code><\/pre>\n<h3>\u9075\u5b88\u00a0<code>robots.txt<\/code><\/h3>\n<p>\u8fdb\u884c\u7f51\u7edc\u722c\u53d6\u65f6\uff0c\u5c0a\u91cd\u7f51\u7ad9\u7684\u00a0<code>robots.txt<\/code>\u00a0\u6587\u4ef6\u662f\u57fa\u672c\u7684\u7f51\u7edc\u793c\u4eea\uff0c\u4e5f\u80fd\u907f\u514d IP \u88ab\u5c01\u7981\u3002<code>Crawl4AI<\/code>\u00a0\u53ef\u4ee5\u81ea\u52a8\u5904\u7406\u3002<\/p>\n<p>\u5728\u00a0<code>CrawlerRunConfig<\/code>\u00a0\u4e2d\u8bbe\u7f6e\u00a0<code>check_robots_txt=True<\/code>\uff1a<\/p>\n<pre><code>respectful_config = CrawlerRunConfig(\r\ncheck_robots_txt=True\r\n)\r\n# result = await crawler.arun(url=\"https:\/\/example.com\", config=respectful_config)\r\n# if not result.success and result.status_code == 403:\r\n#    print(\"Access denied by robots.txt\")\r\n<\/code><\/pre>\n<p><code>Crawl4AI<\/code>\u00a0\u4f1a\u81ea\u52a8\u4e0b\u8f7d\u3001\u7f13\u5b58\u5e76\u89e3\u6790\u00a0<code>robots.txt<\/code>\u00a0\u6587\u4ef6\uff0c\u5982\u679c\u89c4\u5219\u7981\u6b62\u8bbf\u95ee\u76ee\u6807\u00a0<code>URL<\/code>\uff0c<code>arun()<\/code>\u00a0\u4f1a\u5931\u8d25\uff0c<code>result.success<\/code>\u00a0\u4e3a\u00a0<code>False<\/code>\uff0c<code>status_code<\/code>\u00a0\u901a\u5e38\u662f 403\uff0c\u5e76\u9644\u5e26\u76f8\u5e94\u9519\u8bef\u4fe1\u606f\u3002<\/p>\n<h3>\u4f1a\u8bdd\u7ba1\u7406 (<code>Session Management<\/code>)<\/h3>\n<p>\u5bf9\u4e8e\u9700\u8981\u767b\u5f55\u6216\u4fdd\u6301\u72b6\u6001\u7684\u591a\u6b65\u9aa4\u64cd\u4f5c\uff08\u5982\u8868\u5355\u63d0\u4ea4\u3001\u5206\u9875\u5bfc\u822a\uff09\uff0c\u53ef\u4ee5\u4f7f\u7528\u4f1a\u8bdd\u7ba1\u7406\u3002\u901a\u8fc7\u5728\u00a0<code>CrawlerRunConfig<\/code>\u00a0\u4e2d\u6307\u5b9a\u76f8\u540c\u7684\u00a0<code>session_id<\/code>\uff0c\u53ef\u4ee5\u5728\u591a\u4e2a\u00a0<code>arun()<\/code>\u00a0\u8c03\u7528\u4e4b\u95f4\u590d\u7528\u540c\u4e00\u4e2a\u6d4f\u89c8\u5668\u9875\u9762\u5b9e\u4f8b\uff0c\u4fdd\u7559\u00a0<code>cookies<\/code>\u00a0\u548c\u00a0<code>JavaScript<\/code>\u00a0\u72b6\u6001\u3002<\/p>\n<pre><code>import asyncio\r\nfrom crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode\r\nasync def session_example():\r\nasync with AsyncWebCrawler() as crawler:\r\nsession_id = \"my_unique_session\"\r\n# Step 1: Load login page (hypothetical)\r\nlogin_config = CrawlerRunConfig(session_id=session_id, cache_mode=CacheMode.BYPASS)\r\nawait crawler.arun(url=\"https:\/\/example.com\/login\", config=login_config)\r\nprint(\"Login page loaded.\")\r\n# Step 2: Execute JS to fill and submit login form (hypothetical)\r\nlogin_js = \"\"\"\r\ndocument.getElementById('username').value = 'user';\r\ndocument.getElementById('password').value = 'pass';\r\ndocument.getElementById('loginButton').click();\r\n\"\"\"\r\nsubmit_config = CrawlerRunConfig(\r\nsession_id=session_id,\r\njs_code=login_js,\r\njs_only=True, # \u53ea\u6267\u884c JS\uff0c\u4e0d\u91cd\u65b0\u52a0\u8f7d\u9875\u9762\r\nwait_until=\"networkidle\" # \u7b49\u5f85\u767b\u5f55\u540e\u8df3\u8f6c\u5b8c\u6210\r\n)\r\nawait crawler.arun(config=submit_config) # \u65e0\u9700 URL\uff0c\u5728\u5f53\u524d\u9875\u9762\u6267\u884c JS\r\nprint(\"Login submitted.\")\r\n# Step 3: Crawl a protected page within the same session\r\nprotected_config = CrawlerRunConfig(session_id=session_id, cache_mode=CacheMode.BYPASS)\r\nresult = await crawler.arun(url=\"https:\/\/example.com\/dashboard\", config=protected_config)\r\nif result.success:\r\nprint(\"Successfully crawled protected page:\")\r\nprint(result.markdown[:200] + \"...\")\r\nelse:\r\nprint(f\"Failed to crawl protected page: {result.error_message}\")\r\n# \u6e05\u7406\u4f1a\u8bdd (\u53ef\u9009\uff0c\u4f46\u63a8\u8350)\r\n# await crawler.crawler_strategy.kill_session(session_id)\r\nif __name__ == \"__main__\":\r\nasyncio.run(session_example())\r\n<\/code><\/pre>\n<p>\u66f4\u9ad8\u7ea7\u7684\u4f1a\u8bdd\u7ba1\u7406\u5305\u62ec\u5bfc\u51fa\u548c\u5bfc\u5165\u6d4f\u89c8\u5668\u7684\u5b58\u50a8\u72b6\u6001\uff08<code>cookies<\/code>,\u00a0<code>localStorage<\/code>\uff09\uff0c\u5141\u8bb8\u5728\u4e0d\u540c\u811a\u672c\u8fd0\u884c\u4e4b\u95f4\u4fdd\u6301\u767b\u5f55\u72b6\u6001\u3002<\/p>\n<p><code>Crawl4AI<\/code>\u00a0\u63d0\u4f9b\u4e86\u5f3a\u5927\u800c\u7075\u6d3b\u7684\u529f\u80fd\u96c6\uff0c\u901a\u8fc7\u5408\u7406\u914d\u7f6e\uff0c\u53ef\u4ee5\u9ad8\u6548\u3001\u53ef\u9760\u5730\u4ece\u5404\u79cd\u7f51\u7ad9\u63d0\u53d6\u6240\u9700\u4fe1\u606f\uff0c\u5e76\u4e3a\u4e0b\u6e38\u7684 AI \u5e94\u7528\u51c6\u5907\u597d\u9ad8\u8d28\u91cf\u7684\u6570\u636e\u3002<\/p>\n","protected":false},"excerpt":{"rendered":"<p>\u4f20\u7edf\u7f51\u7edc\u722c\u866b\u6846\u67b6\u529f\u80fd\u591a\u6837\uff0c\u4f46\u5728\u5904\u7406\u6570\u636e\u65f6\u5e38\u9700\u8981\u989d\u5916\u8fdb\u884c\u6e05\u6d17\u4e0e\u683c\u5f0f\u5316\uff0c\u8fd9\u4f7f\u5f97\u5b83\u4eec\u4e0e\u5927\u8bed\u8a00\u6a21\u578b\uff08LLM\uff09\u7684\u96c6\u6210\u76f8\u5bf9\u590d\u6742\u3002\u8bb8\u591a\u5de5\u5177\u7684\u8f93\u51fa\uff08\u5982\u539f\u59cb\u00a0HTML\u00a0\u6216\u672a\u7ed3\u6784\u5316\u7684\u00a0JSON\uff09\u5305\u542b\u5927\u91cf\u566a\u58f0\uff0c\u4e0d\u9002\u5408\u76f4\u63a5\u7528\u4e8e\u68c0\u7d22\u589e\u5f3a\u751f\u6210\uff08RAG\uff09\u7b49\u573a\u666f\uff0c\u56e0\u4e3a\u8fd9\u4f1a&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[182],"tags":[],"class_list":["post-30219","post","type-post","status-publish","format-standard","hentry","category-shicao"],"_links":{"self":[{"href":"https:\/\/www.kdjingpai.com\/pt\/wp-json\/wp\/v2\/posts\/30219","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.kdjingpai.com\/pt\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.kdjingpai.com\/pt\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.kdjingpai.com\/pt\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.kdjingpai.com\/pt\/wp-json\/wp\/v2\/comments?post=30219"}],"version-history":[{"count":0,"href":"https:\/\/www.kdjingpai.com\/pt\/wp-json\/wp\/v2\/posts\/30219\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.kdjingpai.com\/pt\/wp-json\/wp\/v2\/media?parent=30219"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.kdjingpai.com\/pt\/wp-json\/wp\/v2\/categories?post=30219"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.kdjingpai.com\/pt\/wp-json\/wp\/v2\/tags?post=30219"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}