LMCache provides the following solution to the problem of double-counting in multi-round dialogs:
- Enable key-value caching: Set at vLLM initialization
KVTransferConfig(kv_connector='LMCacheConnector') - Configuring Storage Policies: Choose appropriate storage based on conversation length (GPU/CPU for short conversations, disk/Redis for long conversations)
- Adjusting Cache Granularity: By
LMCACHE_CHUNK_SIZEParameter sets the token block size of 256-512 - Persistence with Redis: Persistent storage of historical session data to avoid cache invalidation after server reboot
This scheme reuses the intermediate computation results of the dialog history and significantly reduces the amount of GPU computation in multi-round Q&A scenarios.
This answer comes from the articleLMCache: A Key-Value Cache Optimization Tool for Accelerating Reasoning on Large Language ModelsThe































