File Size Control Policy
Fine control of output through multi-dimensional parameters:
- basic limit::
- set up
maxFileSize(in MB) Limit single file size - utilization
maxTokensAutomatic file splitting based on GPT token count
- set up
- Content Filtering::
- configure
selectorPrecise extraction of the target area (e.g..main-content) - pass (a bill or inspection etc)
filterOutCssSelectorsExclude extraneous elements such as headers/footers - start using
simplifyHtmlRemove redundant HTML tags
- configure
- Advanced Techniques::
- utilization
resourceExclusions: ['*.jpg', '*.mp4']Exclusion of media resources - increase
postProcessingHook function for text compression - Enabled for large sites
splitByDomainGroup by subdomain
- utilization
- Follow-up treatment: can be combined with jq and other tools to manually split JSON files
This answer comes from the articleGPT-Crawler: Automatically Crawling Website Content to Generate Knowledge Base DocumentsThe































