Unsloth currently supports the following mainstream visual language models:
- Llama 3.2 Vision (11B parameters)
- Qwen 2.5 VL (7B parameters)
- Pixtral (12B parameters)
Typical processes for handling visual tasks include:
- Dedicated model loading: Unlike normal LLMs, image generation-specific classes are required:
model = AutoModelForImageGeneration.from_pretrained("unslothai/llama-3.2-vision") - Multimodal data processing: need to prepare datasets containing both image and text annotations
- Joint training configuration: Setting the vision_enabled=True parameter in TrainingArguments
- Task-specific fine-tuning: Supports a wide range of tasks such as image description generation, visual question and answer (VQA), graphic matching, etc.
These visual models are particularly suitable for scenarios that require a combination of image understanding and text generation, such as cross-modal applications like smart album management and medical image analysis.
This answer comes from the articleUnsloth: an open source tool for efficiently fine-tuning and training large language modelsThe































