How to avoid the problem of memory explosion during multimodal training?

2025-08-28

1.4 K

A methodology for the control of multimodal training manifolds

Joint visual-verbal training of MiniMind-V requires special optimization strategies:

Phased training:
1. Train CLIP visual coder alone (freeze_lm=True)
2. Fixed vision parameters to train the language head (freeze_vision=True)
3. Final joint fine-tuning (reduce learning_rate=1e-5)
Key Technologies:
- Gradient checkpointing technique (-gradient_checkpointing)
- Using flash attention2 instead of standard attention
- Limit resolution to no more than 224 x 224 during image preprocessing
Options:
1. Using LoRA_V version to train only the vision adapter
2. Use progressive training, 64×64 resolution first and then elevated
3. Distributed training splits vision/language modules to different GPUs

The solution compresses the video memory footprint of a 32-image batch from 38GB to 22GB on a 3090 graphics card, making training possible.