MiniMind-V is an open source project, hosted on GitHub, designed to help users train a lightweight visual language model (VLM) with only 26 million parameters in less than an hour. It is based on the MiniMind language model , the new visual coder and feature projection module , support for image and text joint processing . The project provides complete code from dataset cleaning to model inference, with a training cost as low as about 1.3 RMB for a single GPU (e.g., NVIDIA 3090). MiniMind-V emphasizes simplicity and ease of use, with fewer than 50 lines of code changes, making it a suitable tool for developers to experiment with and learn about the process of constructing visual language models.

Function List
- Provides complete training code for 26 million parameter visual language models, supporting fast training on a single GPU.
- Using the CLIP visual coder, a 224×224 pixel image was processed to generate 196 visual tokens.
- Supports single and multi-image input, combined with text for dialog, image description or Q&A.
- Full process scripts for dataset cleaning, pre-training, and supervised fine-tuning (SFT) are included.
- Provides PyTorch native implementation, supports multi-card acceleration, and is highly compatible.
- Includes a download of model weights to support the Hugging Face and ModelScope platforms.
- Provides a web interface and command line reasoning for easy testing of model effects.
- Support for the wandb tool to record losses and performance during training.
Using Help
The process of using MiniMind-V includes environment configuration, data preparation, model training and effect testing. Each step is described in detail below to help users get started quickly.
Environment Configuration
MiniMind-V requires a Python environment and GPU support. Here are the installation steps:
- Cloning Code
 Run the following command in the terminal to download the project code:git clone https://github.com/jingyaogong/minimind-v cd minimind-v
- Installation of dependencies
 Project offersrequirements.txtfile containing the required libraries. Run the following command:pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simplePython 3.9 or above is recommended. Ensure that PyTorch supports CUDA (if you have a GPU). This can be verified by running the following code: import torch print(torch.cuda.is_available())exports TrueIndicates that the GPU is available.
- Download CLIP Models
 MiniMind-V uses the CLIP model (clip-vit-base-patch16) as a visual encoder. Run the following command to download and place the./model/vision_model::git clone https://huggingface.co/openai/clip-vit-base-patch16 ./model/vision_modelAlso available for download from ModelScope: git clone https://www.modelscope.cn/models/openai-mirror/clip-vit-base-patch16 ./model/vision_model
- Download the base language model weights
 MiniMind-V is based on the MiniMind language model, which requires downloading the language model weights to the./outCatalog. Example:wget https://huggingface.co/jingyaogong/MiniMind2-V-PyTorch/blob/main/lm_512.pth -P ./outor download lm_768.pth, depending on the model configuration.
Data preparation
MiniMind-V uses about 570,000 pre-trained images and 300,000 command fine-tuning data with about 5 GB of storage space. the procedure is as follows:
- Creating a dataset catalog
 In the project root directory, create the./datasetFolder:mkdir dataset
- Download Dataset
 Download a dataset from Hugging Face or ModelScope containing the*.jsonlQ&A data and*imagesPicture data:- Hugging Face: https://huggingface.co/datasets/jingyaogong/minimind-v_dataset
- ModelScope: https://www.modelscope.cn/datasets/gongjy/minimind-v_dataset
 Download and unzip the image data to./dataset::
 unzip pretrain_images.zip -d ./dataset unzip sft_images.zip -d ./dataset
- Validation Dataset
 assure./datasetContains the following files:- pretrain_vlm_data.jsonl: Pre-training data, approximately 570,000 entries.
- sft_vlm_data.jsonl: Single figure fine-tuning data, approximately 300,000 entries.
- sft_vlm_data_multi.jsonl: Multi-map fine-tuning data, approximately 13,600 entries.
- Image folder: contains image files for pre-training and fine-tuning.
 
model training
MiniMind-V training is categorized into pre-training and supervised fine-tuning, and supports single or multi-card acceleration.
- Configuration parameters
 compiler./model/LMConfig.py, set the model parameters. Example:- Miniatures:dim=512,n_layers=8
- Medium model:dim=768,n_layers=16
 These parameters determine the model size and performance.
 
- Miniatures:
- pre-training
 Run pre-training scripts to learn image description capabilities:python train_pretrain_vlm.py --epochs 4The output weights are saved as ./out/pretrain_vlm_512.pth(or768.pthThe CLIP model is frozen.) A single NVIDIA 3090 takes about 1 hour to complete 1 epoch. freezes the CLIP model and trains only the projection layer and the last layer of the language model.
- Supervised fine tuning (SFT)
 Fine-tuning using pre-trained weights to optimize conversational capabilities:python train_sft_vlm.py --epochs 4The output weights are saved as ./out/sft_vlm_512.pth. This step trains the projection layer and the language model with all parameters.
- Doka training (optional)
 If you have N graphics cards, use the following command to accelerate:torchrun --nproc_per_node N train_pretrain_vlm.py --epochs 4interchangeability train_pretrain_vlm.pyFor other training scripts (e.g.train_sft_vlm.py).
- Monitor training
 Training losses can be recorded using wandb:python train_pretrain_vlm.py --epochs 4 --use_wandbView real-time data on the official wandb website. 
Effectiveness Test
Once training is complete, the model can be tested for image dialog capabilities.
- command-line reasoning
 Run the following command to load the model:python eval_vlm.py --load 1 --model_mode 1- --load 1: Load the transformers format model from Hugging Face.
- --load 0: from- ./outLoad PyTorch weights.
- --model_mode 1: Testing fine-tuned models;- 0Testing pre-trained models.
 
- Web Interface Testing
 Launch the web interface:python web_demo_vlm.pyinterviews http://localhost:8000, upload an image and enter text to test.
- input format
 MiniMind-V uses 196@@@Placeholders represent an image. Example:@@@...@@@\n这张图片是什么内容?Example of multi-image input: @@@...@@@\n第一张图是什么?\n@@@...@@@\n第二张图是什么?
- Download Pre-training Weights
 If you don't train, you can just download the official weights:- PyTorch format:https://huggingface.co/jingyaogong/MiniMind2-V-PyTorch
- Transformers format:https://huggingface.co/collections/jingyaogong/minimind-v-67000833fb60b3a2e1f3597d
 
caveat
- Recommended video memory 24GB (e.g. RTX 3090). If video memory is insufficient, reduce the batch size (batch_size).
- Ensuring that the dataset path is correct.*.jsonland image files need to be placed in the./datasetThe
- Freezing CLIP models during training reduces arithmetic requirements.
- Multi-image dialogues have limited effectiveness, and it is recommended to prioritize testing single-image scenarios.
application scenario
- AI algorithmic learning
 MiniMind-V provides concise visual language modeling code suitable for students to understand cross-modal modeling principles. Users can modify the code to experiment with different parameters or data sets.
- Rapid Prototyping
 Developers can prototype image dialog applications based on MiniMind-V. It is lightweight and efficient, and is suitable for low-computing devices such as PCs or embedded systems. It is lightweight and efficient, and is suitable for low-computing-power devices such as PCs or embedded systems.
- Education and training tools
 Colleges and universities can use MiniMind-V in AI courses to show the whole process of model training. The code is clearly commented and suitable for classroom practice.
- Low-cost experiments
 The project training cost is low, suitable for teams with limited budget to test the effect of multimodal models, without the need for high-performance servers.
QA
- What size images does MiniMind-V support?
 Default processing is 224×224 pixels, limited by the CLIP model. Dataset images may be compressed to 128×128 to save space. Larger resolution CLIP models may be tried in the future.
- How much time does it take to train?
 On a single NVIDIA 3090, 1 epoch of pre-training takes about 1 hour, with fine-tuning a bit faster. The exact time varies depending on the hardware and the amount of data.
- Can I just fine-tune it without pre-training?
 Can. Download the official pre-training weights directly and runtrain_sft_vlm.pyFine-tuning.
- What languages are supported?
 Mainly supports Chinese and English, the effect depends on the dataset. Users can extend other languages by fine-tuning.
- How well does the multi-image dialog work?
 The current multi-image dialog capability is limited, and it is recommended that single-image scenarios be prioritized. Improvements can be made in the future with larger models and datasets.
































 English
English				 简体中文
简体中文					           日本語
日本語					           Deutsch
Deutsch					           Português do Brasil
Português do Brasil