LangExtract handles long documents through the following mechanism:
- Intelligent chunking: automatically splits long documents into appropriately sized text blocks
- Parallel processing: by setting the
max_workers
Parameter to control the number of threads (e.g., 4 threads if processing the entire Romeo and Juliet book) - Multi-round extraction: by
num_passes
Parameter settings are extracted multiple times to improve accuracy
Optimization Recommendations:
- Tier 2 Gemini quotas are recommended to avoid rate limiting when processing very long documents
- For complex documents it is possible to switch to a more powerful model (e.g. from the
gemini-2.5-flash
Switch togemini-2.5-pro
) - Ensure stable network connections, especially when using cloud-based models
- The results can be saved using the
save_annotated_documents
method generates a JSONL format file
This answer comes from the articleLangExtract: open source tools to extract structured data from textThe