MiniMind-V is an open source, low-cost visual language model (VLM) training framework hosted on the GitHub platform. It significantly lowers the bar for multimodal AI by combining a lightweight architecture with 26 million parameters and an efficient training scheme that enables developers to complete model training in less than an hour.
Core features include:
- visual language co-processing: Extended vision capabilities based on MiniMind language model with new CLIP vision coder and feature projection module
- Full Process Support: Provides complete code from data cleaning, pre-training to supervised fine-tuning, customizable with only 50 lines of change
- Low-cost training: a single NVIDIA 3090 graphics card can run it, with a pre-training cost of about RMB 1.3
- multimodal interaction: Support single/multiple image inputs to realize tasks such as image description, visual quiz, etc.
- Deployment friendlyProvide two kinds of inference methods: Web interface/command line, compatible with Hugging Face and ModelScope ecology.
This project is especially suitable for developers who need to quickly verify multimodal application prototypes, and its design philosophy emphasizes "code minimalism". The main technical breakthrough lies in the optimization of visual-linguistic feature alignment strategies through the feature projection layer.
This answer comes from the articleMiniMind-V: 1 hour training of a 26M parameter visual language modelThe




























