OmniInsert: A tool for inserting any reference image into video without masking
OmniInsert is a research project developed by ByteDance Intelligent Creation Lab. It is a tool that seamlessly inserts any reference object into a video without the use of a mask (Mask). In the traditional video editing process, if you want to add a new object into the video, you usually need to manually create a...
Wan2.2-S2V-14B: Video Generation Model for Speech-Driven Character Mouth Synchronization
Wan2.2-S2V-14B is a large-scale AI model developed by the Wan-AI team, specialized in generating high-quality videos based on audio, text and images. It adopts an innovative Mixed Expert (MoE) architecture, with a total number of 27B model parameters, but only 14B of them are activated at runtime, effectively balancing the performance and...
SpatialLM: Sweep the room, AI automatically draws 3D models for you
SpatialLM is a large language model designed specifically for processing three-dimensional (3D) point cloud data. Its core function is to understand unstructured 3D geometric data and transform it into structured 3D scene representations. These structured outputs contain architectural elements (e.g., walls, doors, windows) as well as objects with orientation...
VibeVoice-1.5B: A Speech Generation Model Supporting Long Audio Multi-Role Conversations from Microsoft
VibeVoice-1.5B is a cutting-edge open-source Text-to-Speech (TTS) model released by Microsoft Research. It is specifically designed for generating expressive, long-form, multi-character dialog audio, such as podcasts or audiobooks. The core innovation of VibeVoice is its use of a 7...
Grok-2: xAI's Open Source Hybrid Expert Large Language Model
Grok-2 is a second-generation macrolanguage model developed by Elon Musk's xAI in 2024. A key feature of the model is its Mixture-of-Experts (MoE) architecture, which is designed to process information more efficiently. Simply put, there are multiple "experts" within the model...
Baichuan-M2: A Large Language Model for Augmented Reasoning in Healthcare
Baichuan-M2 is an open source large language model with 32 billion (32B) parameters from Baichuan Intelligence. The model focuses on the medical domain and is designed to handle real-world medical reasoning tasks. It is based on the Qwen2.5-32B model, which was developed by introducing an innovative "Large Validator System" (L...
Genie 3: Generating virtual worlds that can be interacted with in real time
Genie 3 is a generalized world model (world model) released by Google DeepMind that represents the latest advancement in AI for simulating and creating virtual environments. The model's most central feature is that it can generate a diverse and dynamic world that supports real-time interaction based solely on a textual description...
Seed-OSS: Open Source Large Language Model for Long Context Reasoning and Versatile Applications
Seed-OSS is a series of open source large language models developed by the Seed team at ByteDance, focusing on long context processing, reasoning capabilities and agent task optimization. The models contain 36 billion parameters, are trained with only 12 trillion tokens, have excellent performance in multiple mainstream benchmarks, and support ......
HRM: Hierarchical Reasoning Model for Complex Reasoning
HRM (Hierarchical Reasoning Model) is a hierarchical reasoning model with only 27 million parameters designed to solve complex reasoning tasks in the field of artificial intelligence. The design of the model is inspired by the hierarchical, multi-timescale information processing of the human brain. It is implemented through a high-level module (negative .....
DeepSeek-V3.1-Base: a large-scale language model for efficiently processing complex tasks
DeepSeek-V3.1-Base is an open source large language model developed by DeepSeek and released on the Hugging Face platform, designed for natural language processing tasks. It has 685 billion parameters, supports multiple data types (BF16, F8_E4M3, F32), and can...
Qwen-Image-Edit: an AI model for editing images based on textual commands
Qwen-Image-Edit is an image editing AI model developed by Alibaba Tongyi Qianqian team. It is trained based on the Qwen-Image model with 20 billion parameters, and its core function is to allow users to modify images through simple Chinese or English text commands. This model also utilizes visual...
GLM-4.5V: A multimodal dialog model capable of understanding images and videos and generating code
GLM-4.5V is a new generation of Visual Language Megamodel (VLM) developed by Zhi Spectrum AI (Z.AI). The model is built based on the flagship text model GLM-4.5-Air using MOE architecture, with 106 billion total references, including 12 billion activation parameters.GLM-4.5V not only processes images and text, but also understands visual...
Qwen-Image: an AI tool for generating high-fidelity images with accurate text rendering
Qwen-Image is a 20B parametric multimodal diffusion model (MMDiT) developed by the Qwen team, focusing on high-fidelity image generation and accurate text rendering. It excels in complex text processing (especially Chinese and English) and image editing. The model supports a variety of art styles such as realistic,...
MiniMax Releases Speech 2.5: Speech Synthesis Technology Breaks Through in Multilingualism and Tone Reproduction
On August 7, MiniMax announced Speech 2.5, a next-generation speech generation model that, according to official data, improves on its predecessor Speech 02 in terms of multilingual expressiveness, tone reproduction accuracy, and the number of supported languages. In the field of Artificial Intelligence Generated Content (AIGC)...
KittenTTS: Lightweight Text-to-Speech Modeling
KittenTTS is an open source text-to-speech (TTS) model focused on lightweight and efficiency. It takes up less than 25MB of storage, has about 15 million parameters, and runs on low-end devices without GPU support.Developed by the KittenML team, KittenTTS offers multiple...
GPT-OSS: OpenAI's Open Source Big Model for Efficient Reasoning
GPT-OSS is a family of open source language models from OpenAI, including gpt-oss-120b and gpt-oss-20b, with 117 billion and 210 billion parameters, respectively, licensed under the Apache 2.0 license, which allows developers to download, modify, and deploy them free of charge. gpt-oss...
SongGeneration: open-source AI model for generating high-quality music and lyrics
SongGeneration is a music generation model developed and open-sourced by Tencent AI Lab, focusing on generating high-quality songs, including lyrics, accompaniment and vocals. It is based on the LeVo framework, combining the language model LeLM and music codecs to support song generation in English and Chinese. The model is on a dataset of millions of songs...
Step3: Efficient generation of open source big models for multimodal content
Step3 is an open source multimodal macromodeling project developed by StepFun, hosted on GitHub, that aims to provide efficient and cost-effective text, image, and speech content generation capabilities. The project is centered on a 32.1 billion-parameter (3.8 billion active parameters) mixed-expert model (MoE) that optimizes the speed of inference...
Seed Diffusion: Validating High-Speed Language Models for Next-Generation Architectures
Seed Diffusion is an experimental language model, launched by the ByteDance Seed team in conjunction with the Academy of Intelligent Industry Research (AIR) at Tsinghua University. This website is a technology demonstration platform for the model. The model is based on the discrete diffusion technique, and the main goal is to explore the underlying framework of the next generation language model that can be...
Top