Overseas access: www.kdjingpai.com
Bookmark Us
Current Position:fig. beginning » AI News

Hugging Face Introduces SmolVLM, a Small Multimodal Model that Runs on End Devices

2024-12-01 2.3 K

SmolVLM is a small multimodal model with a parameter count of 2 billion that accepts input from any combination of images and text and generates textual output.

Hugging Face 推出可在终端设备上运行的小型多模态模型 SmolVLM-1

After launching the SmolLM lightweight language model in July, Hugging Face, an AI application development platform, this week released SmolVLM, a lightweight multimodal model that focuses on lightweight and high performance, adding to its lineup of small language models.

SmolVLM is a small multimodal model with 2 billion references and is known as the performance leader in its class (State-of-the-Art, SOTA). It is capable of accepting any combination of images and text as input, but as a lightweight model, will only generate textual output.SmolVLM can answer questions about images, describe the content of an image, tell a story based on multiple images, or be used as a purely linguistic model. According to the development team, SmolVLM is based on a lightweight architecture that is well suited to run on devices while still performing multimodal tasks well.

SmolVLM's architecture is based on Hugging Face's previously introduced vision model, IDEFICS 3, and even Transformer The Hugging Face implementation is the same. However, Hugging Face has the same implementation of IDEFICS Several improvements have been made. First, the core of the language model was switched from Llama 3.1 8B to SmolLM2 1.7B. Second, SmolVLM uses more advanced image compression techniques, such as pixel shuffle strategy and larger patches for visual Token coding, resulting in improved coding efficiency, faster inference, and less memory usage.

Hugging Face emphasized the efficiency and memory usage advantages of SmolVLM and published test data comparing it to equivalent parametric models. SmolVLM outperforms models such as InternVL2, PaliGemma, MM1.5, moondream, and MiniCPM-V-2 in multimodal comprehension, reasoning, math, and text comprehension. It also outperforms most models in terms of GPU memory usage efficiency. Compared to Alibaba's Qwen2-V2, SmolVLM delivers 3.3 to 4.5 times faster pre-population throughput and 7.5 to 16 times higher generation throughput.

Hugging Face has released three model versions of the SmolVLM family, including SmolVLM-Base for fine-tuning, SmolVLM-Synthetic for fine-tuning based on synthetic datasets, and the command-fine-tuned version, SmolVLM Instruct, which is ready for direct end-user interaction. All model checkpoints, training datasets, training methods, and tools for SmolVLM are based on the Apache 2.0open source license

🍐 Duck & Pear AI Article Smart Writer
Selection → Writing → Publishing
Fully automated!
WordPress AI Writing Plugin
500+ content creators are using
🎯Intelligent Selection: Batch generation, say goodbye to exhaustion
🧠retrieval enhancement: networking + knowledge base with depth
Fully automated: Writing → Mapping → Publishing
💎Permanently free: Free version = Paid version, no limitations
🔥 Download the plugin for free now!
✅ Free forever · 🔓 100% Open Source · 🔒 Local storage of data

Recommended

Can't find AI tools? Try here!

Enter keywords.Accessibility to Bing SearchYou can find AI tools on this site quickly.

Top