How to overcome the problem of information loss when processing long documents with InternLM-XComposer?

2025-09-05

1.5 K

96K Long Text Processing Optimization Solution

The following measures are required to ensure the quality of long document processing:

Preprocessing strategies:
1. Document chunking (no more than 32K tokens per chunk)
2. Add chapter markers (e.g. [CHAPTER 1])
3. Generate a summary prompt: "Based on the following 3 parts..."
Model Configuration:
1. Ensure that a version of the model that supports 96K is loaded (internlm-xcomposer2d5-7b-long)
2. Adjust the attention_window parameter to its maximum value.
3. Enable memory_compression=True option
Post-integration methods:
1. Combining segmented results using the Map-Reduce algorithm
2. Knowledge mapping for information linkage
3. Use of RAG techniques to supplement background knowledge

Experiments show that combining chunking with memory_compression results in a retention rate of 92% for 96K documents.