Advanced Content Extraction Functionality Explained
functional value
This feature allows crawling directly from a specified web pagePlain text contentcap (a poem)Related Image Resources, addressing the following pain points:
- Bypassing website anti-crawler mechanisms to obtain key information
- Consistent formatting when batch processing multiple pages
- Avoid manual cleanup of distracting elements such as ads, navigation bars, etc.
Specific implementation methods
utilizationextract()Typical scenarios for the method:
urls = ["https://example.com/page1", "https://example.com/page2"]
response = client.extract(
urls=urls,
include_images=True, # 是否提取图片
max_text_length=5000 # 控制提取文本长度
)
Return data structure
- raw_content: Remove plain text from HTML tags
- images: List of image URLs (when include_images=True)
- metadata: Contains meta information such as article source, crawl time, etc.
Attention:Supports up to 20 URLs for a single call, which can be increased to 100 for the commercial version.
This answer comes from the articleTavily: Real-Time Information Search API Service for AIThe
































