A core approach to solving low graphics memory device deployments
For the optimization of 8GB video memory devices, Jan-nano provides the following specific solutions:
- Using the quantized version of GGUF: Select the Q4_K_M quantization level, which provides the best balance of performance and resource usage on 8GB devices. Download commands via Hugging Face:
huggingface-cli download bartowski/Menlo_Jan-nano-GGUF --include "Menlo_Jan-nano-Q4_K_M.gguf" - Adjustment of inference parameters: Limit the maximum number of tokens at startup (e.g.
--max-model-len 4096), and turn off non-essential features (such as reducing thetool-call-parser(number of concurrencies) - Adoption of a chunking strategy: for long text tasks, send text fragments in batches through the API, and finally splice the results
Alternatives include choosing a lighter version of Q3_K_XL (which requires tolerating a performance degradation of about 5%), or running in CPU+RAM mode (which requires configuring thepip install llama-cpp-python)
This answer comes from the articleJan-nano: a lightweight and efficient model for text generationThe































