VideoRAG's multimodal retrieval system uses theFeature level fusionstrategy, the workflow can be divided into four stages:
- Cross-modal feature extraction::
- Visual channels: extracting CLIP features from keyframes using ImageBind
- Text channel: embedding vectors of ASR transcribed text obtained by Distil-Whisper
- Hierarchical Index Construction::
- Video-level coarse-grained indexing (HNSW graph structure)
- Segment-level fine-grained indexing (Faiss-IVF vector library)
- Query routing mechanism::
- Plain text query: prioritize the retrieval of knowledge graph nodes
- Vision-related queries: activation cross-modal similarity computation
- Mixed Sort Output: Combined semantic relevance, temporal proximity, and cross-modal consistency dimensions for ranking results
The mechanism achieved a top-5 retrieval accuracy of 81.31 TP3T in the LongerVideos benchmark, significantly outperforming the unimodal baseline approach.
This answer comes from the articleVideoRAG: A RAG framework for understanding ultra-long videos with support for multimodal retrieval and knowledge graph constructionThe































