Countermeasures for insufficient video memory
The following strategies can be adopted to deal with the problem of insufficient video memory:
- Use the low_memory configuration:Enable memory optimization schemes via -profile low_memory to automatically enable memory-saving techniques such as FP8 quantization
- Model quantification:Manually specify quantization awq/squeezellm etc.
- Adjust the model slice:Decrease the value of the -tensor-parallel-size parameter (set to 1 for a single GPU)
- Uninstallation Policy:Setting the -swap-space parameter to utilize system memory expansion
Diagnostic Steps: When loading fails, immediately check the specific error code using the log viewer provided by vllm-cli. If it is an OOM error, use vllm-cli info to check the available video memory first, and then choose to appropriately reduce the model specification or enable a stronger quantization scheme. For models on the HuggingFace Hub, take care to select the appropriate branch (e.g., select the 4bit quantization version).
This answer comes from the articlevLLM CLI: Command Line Tool for Deploying Large Language Models with vLLMThe