LMCache innovatively extends the application scope of traditional KV caching, enabling it to optimize the inference process of multimodal models. The system encodes and processes image tokens through a special hashing algorithm (mm_hashes), and uniformly caches key-value pairs of visual and textual features in the same storage system. This technology significantly reduces the GPU memory footprint of visual language models (e.g., CLIP, Flamingo, etc.) and dramatically improves inference speed while ensuring output quality. The official LMCache-Examples repository contains specific implementation examples of multimodal scenarios, demonstrating how to cache and reuse the intermediate computation results of image-text pairs.
This answer comes from the articleLMCache: A Key-Value Cache Optimization Tool for Accelerating Reasoning on Large Language ModelsThe