MiniMind-V is an open source project, hosted on GitHub, designed to help users train a lightweight visual language model (VLM) with only 26 million parameters in less than an hour. It is based on the MiniMind language model , the new visual coder and feature projection module , support for image and text joint processing . The project provides complete code from dataset cleaning to model inference, with a training cost as low as about 1.3 RMB for a single GPU (e.g., NVIDIA 3090). MiniMind-V emphasizes simplicity and ease of use, with fewer than 50 lines of code changes, making it a suitable tool for developers to experiment with and learn about the process of constructing visual language models.
Function List
- Provides complete training code for 26 million parameter visual language models, supporting fast training on a single GPU.
- Using the CLIP visual coder, a 224×224 pixel image was processed to generate 196 visual tokens.
- Supports single and multi-image input, combined with text for dialog, image description or Q&A.
- Full process scripts for dataset cleaning, pre-training, and supervised fine-tuning (SFT) are included.
- Provides PyTorch native implementation, supports multi-card acceleration, and is highly compatible.
- Includes a download of model weights to support the Hugging Face and ModelScope platforms.
- Provides a web interface and command line reasoning for easy testing of model effects.
- Support for the wandb tool to record losses and performance during training.
Using Help
The process of using MiniMind-V includes environment configuration, data preparation, model training and effect testing. Each step is described in detail below to help users get started quickly.
Environment Configuration
MiniMind-V requires a Python environment and GPU support. Here are the installation steps:
- Cloning Code
Run the following command in the terminal to download the project code:git clone https://github.com/jingyaogong/minimind-v cd minimind-v
- Installation of dependencies
Project offersrequirements.txt
file containing the required libraries. Run the following command:pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
Python 3.9 or above is recommended. Ensure that PyTorch supports CUDA (if you have a GPU). This can be verified by running the following code:
import torch print(torch.cuda.is_available())
exports
True
Indicates that the GPU is available. - Download CLIP Models
MiniMind-V uses the CLIP model (clip-vit-base-patch16
) as a visual encoder. Run the following command to download and place the./model/vision_model
::git clone https://huggingface.co/openai/clip-vit-base-patch16 ./model/vision_model
Also available for download from ModelScope:
git clone https://www.modelscope.cn/models/openai-mirror/clip-vit-base-patch16 ./model/vision_model
- Download the base language model weights
MiniMind-V is based on the MiniMind language model, which requires downloading the language model weights to the./out
Catalog. Example:wget https://huggingface.co/jingyaogong/MiniMind2-V-PyTorch/blob/main/lm_512.pth -P ./out
or download
lm_768.pth
, depending on the model configuration.
Data preparation
MiniMind-V uses about 570,000 pre-trained images and 300,000 command fine-tuning data with about 5 GB of storage space. the procedure is as follows:
- Creating a dataset catalog
In the project root directory, create the./dataset
Folder:mkdir dataset
- Download Dataset
Download a dataset from Hugging Face or ModelScope containing the*.jsonl
Q&A data and*images
Picture data:- Hugging Face: https://huggingface.co/datasets/jingyaogong/minimind-v_dataset
- ModelScope: https://www.modelscope.cn/datasets/gongjy/minimind-v_dataset
Download and unzip the image data to./dataset
::
unzip pretrain_images.zip -d ./dataset unzip sft_images.zip -d ./dataset
- Validation Dataset
assure./dataset
Contains the following files:pretrain_vlm_data.jsonl
: Pre-training data, approximately 570,000 entries.sft_vlm_data.jsonl
: Single figure fine-tuning data, approximately 300,000 entries.sft_vlm_data_multi.jsonl
: Multi-map fine-tuning data, approximately 13,600 entries.- Image folder: contains image files for pre-training and fine-tuning.
model training
MiniMind-V training is categorized into pre-training and supervised fine-tuning, and supports single or multi-card acceleration.
- Configuration parameters
compiler./model/LMConfig.py
, set the model parameters. Example:- Miniatures:
dim=512
,n_layers=8
- Medium model:
dim=768
,n_layers=16
These parameters determine the model size and performance.
- Miniatures:
- pre-training
Run pre-training scripts to learn image description capabilities:python train_pretrain_vlm.py --epochs 4
The output weights are saved as
./out/pretrain_vlm_512.pth
(or768.pth
The CLIP model is frozen.) A single NVIDIA 3090 takes about 1 hour to complete 1 epoch. freezes the CLIP model and trains only the projection layer and the last layer of the language model. - Supervised Fine Tuning (SFT)
Fine-tuning using pre-trained weights to optimize conversational capabilities:python train_sft_vlm.py --epochs 4
The output weights are saved as
./out/sft_vlm_512.pth
. This step trains the projection layer and the language model with all parameters. - Doka training (optional)
If you have N graphics cards, use the following command to accelerate:torchrun --nproc_per_node N train_pretrain_vlm.py --epochs 4
interchangeability
train_pretrain_vlm.py
For other training scripts (e.g.train_sft_vlm.py
). - Monitor training
Training losses can be recorded using wandb:python train_pretrain_vlm.py --epochs 4 --use_wandb
View real-time data on the official wandb website.
Effectiveness Test
Once training is complete, the model can be tested for image dialog capabilities.
- command-line reasoning
Run the following command to load the model:python eval_vlm.py --load 1 --model_mode 1
--load 1
: Load the transformers format model from Hugging Face.--load 0
: from./out
Load PyTorch weights.--model_mode 1
: Testing fine-tuned models;0
Testing pre-trained models.
- Web Interface Testing
Launch the web interface:python web_demo_vlm.py
interviews
http://localhost:8000
, upload an image and enter text to test. - input format
MiniMind-V uses 196@@@
Placeholders represent an image. Example:@@@...@@@\n这张图片是什么内容?
Example of multi-image input:
@@@...@@@\n第一张图是什么?\n@@@...@@@\n第二张图是什么?
- Download Pre-training Weights
If you don't train, you can just download the official weights:- PyTorch format:https://huggingface.co/jingyaogong/MiniMind2-V-PyTorch
- Transformers format:https://huggingface.co/collections/jingyaogong/minimind-v-67000833fb60b3a2e1f3597d
caveat
- Recommended video memory 24GB (e.g. RTX 3090). If video memory is insufficient, reduce the batch size (
batch_size
). - Ensuring that the dataset path is correct.
*.jsonl
and image files need to be placed in the./dataset
The - Freezing CLIP models during training reduces arithmetic requirements.
- Multi-image dialogues have limited effectiveness, and it is recommended to prioritize testing single-image scenarios.
application scenario
- AI algorithmic learning
MiniMind-V provides concise visual language modeling code suitable for students to understand cross-modal modeling principles. Users can modify the code to experiment with different parameters or data sets. - Rapid Prototyping
Developers can prototype image dialog applications based on MiniMind-V. It is lightweight and efficient, and is suitable for low-computing devices such as PCs or embedded systems. It is lightweight and efficient, and is suitable for low-computing-power devices such as PCs or embedded systems. - Education and training tools
Colleges and universities can use MiniMind-V in AI courses to show the whole process of model training. The code is clearly commented and suitable for classroom practice. - Low-cost experiments
The project training cost is low, suitable for teams with limited budget to test the effect of multimodal models, without the need for high-performance servers.
QA
- What size images does MiniMind-V support?
Default processing is 224×224 pixels, limited by the CLIP model. Dataset images may be compressed to 128×128 to save space. Larger resolution CLIP models may be tried in the future. - How much time does it take to train?
On a single NVIDIA 3090, 1 epoch of pre-training takes about 1 hour, with fine-tuning a bit faster. The exact time varies depending on the hardware and the amount of data. - Can I just fine-tune it without pre-training?
Can. Download the official pre-training weights directly and runtrain_sft_vlm.py
Fine-tuning. - What languages are supported?
Mainly supports Chinese and English, the effect depends on the dataset. Users can extend other languages by fine-tuning. - How well does the multi-image dialog work?
The current multi-image dialog capability is limited, and it is recommended that single-image scenarios be prioritized. Improvements can be made in the future with larger models and datasets.