A methodology for the control of multimodal training manifolds
Joint visual-verbal training of MiniMind-V requires special optimization strategies:
- Phased training:
- Train CLIP visual coder alone (freeze_lm=True)
- Fixed vision parameters to train the language head (freeze_vision=True)
- Final joint fine-tuning (reduce learning_rate=1e-5)
- Key Technologies:
- Gradient checkpointing technique (-gradient_checkpointing)
- Using flash attention2 instead of standard attention
- Limit resolution to no more than 224 x 224 during image preprocessing
- Options:
- Using LoRA_V version to train only the vision adapter
- Use progressive training, 64×64 resolution first and then elevated
- Distributed training splits vision/language modules to different GPUs
The solution compresses the video memory footprint of a 32-image batch from 38GB to 22GB on a 3090 graphics card, making training possible.
This answer comes from the articleMiniMind: 2 hours from scratch training 26M parameters GPT open source toolsThe































