Gemma 3n
Google is expanding its footprint for inclusive AI with the release of Gemma 3 and Gemma 3 QAT, open source models that run on a single cloud or desktop gas pedal. If Gemma 3 brought powerful cloud and desktop capabilities to developers, this May 20, 2025 release...
MoviiGen 1.1
MoviiGen1.1 is an open source AI tool developed by ZuluVision that focuses on generating high quality videos from text. It supports 720P and 1080P resolutions and is especially suitable for professional video production that requires cinematic visual effects. Users can generate videos from simple text descriptions with natural dynamic...
HiDream-I1
HiDream-I1 is an open source image generation base model with 17 billion parameters to quickly generate high quality images. Users only need to enter a textual description, and the model can generate images in a variety of styles including realistic, cartoon, art, and more. Developed by the HiDream.ai team and hosted on GitHub, the project picks...
Imagen 4
Google DeepMind's recently launched Imagen 4 model, the latest iteration of its image generation technology, is quickly becoming an industry focal point. The model has made significant progress in improving the richness, accuracy of detail, and speed of image generation, working to bring the user's imagination to life in ways never before...
BAGEL
BAGEL is an open source multimodal base model developed by the ByteDance Seed team and hosted on GitHub.It integrates text comprehension, image generation, and editing capabilities to support cross-modal tasks. The model has 7B active parameters (14B parameters in total) and uses Mixture-of-Tra...
MiniMax Speech 02
With the continuous evolution of AI technologies, personalized and highly natural voice interaction has become a key requirement for many intelligent applications. However, existing text-to-speech (TTS) technologies still face challenges in meeting large-scale personalized tones, multilingual coverage, and highly realistic emotion expression. To address these line...
Windsurf SWE-1
SWE-1: A New Generation of Cutting-Edge Models for Software Engineering Recently, the much-anticipated SWE-1 family of models was released. Designed to optimize the entire software engineering process, this family of models goes far beyond the traditional task of writing code. Currently, the SWE-1 family consists of three well-positioned models:...
VideoMind
VideoMind is an open source multimodal AI tool focused on inference, Q&A and summary generation for long videos. It was developed by Ye Liu of the Hong Kong Polytechnic University and a team from Show Lab at the National University of Singapore. The tool mimics the way humans understand video by splitting the task into planning, localization, checking...
MoshiVis
MoshiVis is an open source project developed by Kyutai Labs and hosted on GitHub. It is based on the Moshi speech-to-text model (7B parameters), with about 206 million new adaptation parameters and the frozen PaliGemma2 visual coder (400M parameters), allowing the model...
Qwen2.5-Omni
Qwen2.5-Omni is an open source multimodal AI model developed by Alibaba Cloud Qwen team. It can process multiple inputs such as text, images, audio, and video, and generate text or natural speech responses in real-time. The model was released on March 26, 2025, and the code and model files are hosted on GitH....
StarVector
StarVector is an open source project created by developers such as Juan A. Rodriguez to convert images and text into Scalable Vector Graphics (SVG). This tool uses a visual language model that understands image content and text instructions to generate high-quality SVG code. Its core ...
LaWGPT
LaWGPT is an open source project supported by the Machine Learning and Data Mining Research Group of Nanjing University, which is dedicated to building a large language model based on Chinese legal knowledge. It extends the proprietary word lists in the legal domain based on generalized Chinese models (such as Chinese-LLaMA and ChatGLM), and through large-scale...
Baichuan-Audio
Baichuan-Audio is an open source project developed by Baichuan Intelligence (baichuan-inc), hosted on GitHub, focusing on end-to-end voice interaction technology. The project provides a complete audio processing framework that can transform speech input into discrete audio tokens , and then through a large model to generate a pair of ...
Step-Audio
Step-Audio is an open source intelligent speech interaction framework designed to provide out-of-the-box speech understanding and generation capabilities for production environments. The framework supports multi-language dialog (e.g., Chinese, English, Japanese), emotional speech (e.g., happy, sad), regional dialects (e.g., Cantonese, Szechuan), adjustable speech rate...
DeepSeek-VL2
DeepSeek-VL2 is a series of advanced Mixture-of-Experts (MoE) visual language models that significantly improve the performance of its predecessor, DeepSeek-VL. The models excel in tasks such as visual quizzing, optical character recognition, document/table/diagram comprehension, and visual localization.De...
VITA
VITA is a leading open source interactive multimodal large language modeling project, pioneering the ability to achieve true full multimodal interaction. The project launched VITA-1.0 in August 2024, pioneering the first open source interactive fully modal large language model.In December 2024, the project launched a major upgrade...
AnyText
AnyText is a revolutionary multilingual visual text generation and editing tool developed based on the diffusion model. It generates natural, high-quality multilingual text in images and supports flexible text editing capabilities. It was developed by a team of researchers and received Spotlight honors at the ICLR 2024 conference...
Megrez-3B-Omni
Infini-Megrez is an edge intelligence solution developed by the unquestioned core dome (Infinigence AI), aiming to achieve efficient multimodal understanding and analysis through hardware and software co-design. At the core of the project is the Megrez-3B model, which supports integrated image, text and audio understanding with high accuracy and fast...
OmniGen
OmniGen is a "universal" image generation model developed by VectorSpaceLab that allows users to create diverse and contextually rich visuals with simple text prompts or multimodal inputs. It is particularly well suited for scenes that require character recognition and consistent character rendering. Users...