96K Long Text Processing Optimization Solution
The following measures are required to ensure the quality of long document processing:
- Preprocessing strategies:
1. Document chunking (no more than 32K tokens per chunk)
2. Add chapter markers (e.g. [CHAPTER 1])
3. Generate a summary prompt: "Based on the following 3 parts..." - Model Configuration:
1. Ensure that a version of the model that supports 96K is loaded (internlm-xcomposer2d5-7b-long)
2. Adjust the attention_window parameter to its maximum value.
3. Enable memory_compression=True option - Post-integration methods:
1. Combining segmented results using the Map-Reduce algorithm
2. Knowledge mapping for information linkage
3. Use of RAG techniques to supplement background knowledge
Experiments show that combining chunking with memory_compression results in a retention rate of 92% for 96K documents.
This answer comes from the articleInternLM-XComposer: a multimodal macromodel for outputting very long text and image-video comprehensionThe































