A Low-Cost Solution for Fast Training of Visual Language Models
For researchers or developers with limited budgets, efficient training can be achieved through the MiniMind-V program. Below is a step-by-step solution:
- Hardware Selection: Training can be done with a single NVIDIA 3090 (24GB of RAM), no need for multiple card servers!
- cost control: The overall training cost of the program is approximately RMB 1.3, and key advantages include:
- Lightweight model design with only 26 million parameters
- Freeze CLIP visual coder parameters to train only projection layer
- Use of efficient data preprocessing methods
- Time Optimization: Complete 1 epoch of training in 1 hour with specific tips:
- Use of pre-built cleaned dataset (~5GB)
- Default batch size settings for proper utilization of video memory
- Using PyTorch native implementation to ensure operational efficiency
It is recommended to follow the complete process provided by the program: 4 epochs of pre-training, then 4 epochs of fine-tuning, with the total time controlled within 8 hours. If the effect is insufficient, the amount of data rather than the number of parameters can be increased appropriately.
This answer comes from the articleMiniMind-V: 1 hour training of a 26M parameter visual language modelThe































