Three-step optimization program
For an average computer with 4-8GB of RAM, performance can be significantly improved by..:
- Model Selection: Prefer quantized Q4_K_M level small models (less than 1GB), such as the gemma-3-1b-it recommended in the article, which reduces the volume by 75% compared to the original FP16 model but retains more than 90% effect.
- system optimization::
- Close other memory-hungry programs (e.g. browsers) and make sure you have at least 2GB of free memory
- Right-click on the EXE file while the program is running → Properties → check "Run as administrator" (not required but can raise the priority of resources)
- Tips for use::
- Avoid frequent switching after the first model load to keep the model in memory
- Putting model files on USB3.0 high-speed USB flash drive reduces 10% loading time
- Complex tasks split into multiple short conversations (no more than 200 words for a single question)
After testing, the optimized generation speed can be increased from 8 tokens/sec to 18-22 tokens/sec to a usable level on an i5-8250U/8GB entry laptop. If it still doesn't meet the demand, try the more extreme Q2_K quantization model (with reduced accuracy but halved in size again).
This answer comes from the articleLocal LLM Notepad: A Portable Tool for Running Local Large Language Models OfflineThe































