A practical scheme for cross-modal feature alignment
MiniMind-V addresses the core challenges of visual-verbal feature alignment using the following innovative approach:
- Visual coding options::
- Visual features were extracted directly using the CLIP pre-trained model (196 tokens)
- Preserving CLIP's powerful cross-modal semantic space
- Projection Layer Design::
- Specialized feature projection module connects visual and verbal modalities
- Mapping image token dimensions to language model input space
- Efficient Alignment with Simple Linear Layers
- Training strategy optimization::
- The pre-training phase only fine-tunes the projection layer and the final layer of the language model
- Gradual unfreezing of more parameters during the fine-tuning phase
- Enhancing cross-modal understanding using contrastive learning loss
Practical suggestion: for custom datasets, you can freeze the visual coder to train only the projection layer for 1-2 epochs first, and then unfreeze more parameters after the loss is stabilized. The project provides a complete alignment monitoring script, which can be used to observe the feature spatial distribution changes through wandb.
This answer comes from the articleMiniMind-V: 1 hour training of a 26M parameter visual language modelThe