Low Resource Environmental Optimization Guide
For GPU devices with insufficient video memory (e.g., 24GB or less), the following scheme can be used:
- knowledge slicing technology: Use
split_knowledge.py
Chunking large knowledge bases by topic, dynamically loaded at runtime - 8bit quantization: Add
--quantize
parameterizationintegrate.py
Model Volume Reduction 50% - CPU offload strategy: Configuration
offload_knowledge=True
Storing inactive knowledge vectors in memory - Batch optimization: Adjustments
--batch_size 4
Avoiding video memory overflow
When running Llama-3-8B on RTX3090 (24GB): 1) Slicing and processing 1 million pieces of knowledge can keep the video memory footprint within 18GB; 2) Q&A latency is reduced from 320ms to 210ms after quantization. alternatively, small models such as Microsoft Phi-3-mini can be considered to work with the knowledge enhancement, which results in a performance loss of less than 15% but a lower video memory requirement of 80%.
This answer comes from the articleKBLaM: An Open Source Enhanced Tool for Embedding External Knowledge in Large ModelsThe