
xAI Grok Imagine API:生产环境开箱即用的多模态音视频生成服务
xAI 于 2026 年 1 月正式推出了 Grok Imagine API,这是一项面向开发者和企业的生产级多模态视频生成服务。该服务基于 xAI 内部研发的 “Aurora” 模型构建,核心能力在于能够根据文本提...

DeepSeek-OCR: An Open Source Optical Character Recognition (OCR) Tool
DeepSeek-OCR is an optical character recognition (OCR) tool developed and open sourced by DeepSeek-AI. It proposes a new approach called “Contextual Optical Compression”, which rethinks the role of the visual coder from the perspective of the Large Language Model (LLM). The tool does not simply recognize graphs...

OmniInsert: A tool for inserting any reference image into video without masking
OmniInsert is a research project developed by ByteDance Intelligent Creation Lab. It is a tool that seamlessly inserts any reference object into a video without the use of a mask. In the traditional video editing process, if you want to add a new object to the video, you usually need to manually create a precise “mask” to frame out the...

Wan2.2-S2V-14B: Video Generation Model for Speech-Driven Character Mouth Synchronization
Wan2.2-S2V-14B is a large-scale AI model developed by the Wan-AI team, specialized in generating high-quality videos based on audio, text and images. It adopts an innovative Mixed Expert (MoE) architecture, with a total number of 27B model parameters, but only 14B of them are activated at runtime, effectively balancing performance and computational cost. ...

SpatialLM: Sweep the room, AI automatically draws 3D models for you
SpatialLM is a large language model designed specifically for processing three-dimensional (3D) point cloud data. Its core function is to understand unstructured 3D geometric data and transform it into structured 3D scene representations. These structured outputs contain architectural elements (e.g., walls, doors, windows) as well as object bounding boxes with orientation and their semantic categories. Unlike many of the needs ...

VibeVoice-1.5B: A Speech Generation Model Supporting Long Audio Multi-Role Conversations from Microsoft
VibeVoice-1.5B is a cutting-edge open-source Text-to-Speech (TTS) model released by Microsoft Research. It is specifically designed for generating expressive, long-form, multi-character dialog audio, such as podcasts or audiobooks. The core innovation of VibeVoice is its use of a 7...

Grok-2: xAI's Open Source Hybrid Expert Large Language Model
Grok-2 is a second-generation macrolanguage model developed by Elon Musk's xAI in 2024. A key feature of the model is its Mixture-of-Experts (MoE) architecture, which is designed to process information more efficiently. Simply put, there are multiple "experts" within the model...

Baichuan-M2: A Large Language Model for Augmented Reasoning in Healthcare
Baichuan-M2 is an open source large language model with 32 billion (32B) parameters from Baichuan Intelligence. The model focuses on the medical domain and is designed to handle real-world medical reasoning tasks. It is based on the Qwen2.5-32B model, which was developed by introducing an innovative “Large ...

Genie 3: Generating virtual worlds that can be interacted with in real time
Genie 3 is a generalized world model (world model) released by Google DeepMind, which represents the latest advancement in AI in simulating and creating virtual environments. The core feature of this model is that it can generate a diverse and dynamic world that supports real-time interactions based solely on a textual description. Users can use this...

Seed-OSS: Open Source Large Language Model for Long Context Reasoning and Versatile Applications
Seed-OSS is a series of open source large language models developed by the Seed team at ByteDance, focusing on long context processing, reasoning capabilities and agent task optimization. The models contain 36 billion parameters and are trained with only 12 trillion tokens, with excellent performance in multiple mainstream benchmarks and support for ...

HRM: Hierarchical Reasoning Model for Complex Reasoning
HRM (Hierarchical Reasoning Model) is a hierarchical reasoning model with only 27 million parameters designed to solve complex reasoning tasks in the field of artificial intelligence. The design of the model is inspired by the hierarchical, multi-timescale information processing of the human brain. It does this through a high-level module (responsible for easing ...

DeepSeek-V3.1-Base: a large-scale language model for efficiently processing complex tasks
DeepSeek-V3.1-Base is an open source large language model developed by DeepSeek and released on the Hugging Face platform, designed for natural language processing tasks. It has 685 billion parameters, supports multiple data types (BF16, F8_E4M3, F32), and can...

Qwen-Image-Edit: an AI model for editing images based on textual commands
Qwen-Image-Edit is an image editing AI model developed by Alibaba Tongyi Qianqian team. It is trained based on the Qwen-Image model with 20 billion parameters, and its core function is to allow users to modify images through simple Chinese or English text commands. This model utilizes both visual semantic understanding and...

GLM-4.5V: A multimodal dialog model capable of understanding images and videos and generating code
GLM-4.5V is a new generation of Visual Language Megamodel (VLM) developed by Zhi Spectrum AI (Z.AI). The model is built based on the flagship text model GLM-4.5-Air using MOE architecture, with 106 billion total references, including 12 billion activation parameters.GLM-4.5V not only processes images and text, but also understands visual...

Qwen-Image: an AI tool for generating high-fidelity images with accurate text rendering
Qwen-Image is a 20B parametric multimodal diffusion model (MMDiT) developed by the Qwen team, specializing in high-fidelity image generation and accurate text rendering. It excels in complex text processing (especially Chinese and English) and image editing. The model supports a variety of art styles, such as realistic, anime, and high-definition posters,...

MiniMax Releases Speech 2.5: Speech Synthesis Technology Breaks Through in Multilingualism and Tone Reproduction
On August 7, MiniMax released its next-generation speech generation model, Speech 2.5, which, according to official data, improves on its predecessor Speech 02 in terms of multilingual expressiveness, tone reproduction accuracy, and the number of supported languages. In the field of Artificial Intelligence Generated Content (AIGC), the text...

KittenTTS: Lightweight Text-to-Speech Modeling
KittenTTS is an open source text-to-speech (TTS) model focused on lightweight and efficiency. It takes up less than 25MB of storage, has about 15 million parameters, and runs on low-end devices without GPU support.Developed by the KittenML team, KittenTTS offers multiple...

GPT-OSS: OpenAI's Open Source Big Model for Efficient Reasoning
GPT-OSS is a family of open source language models from OpenAI, including gpt-oss-120b and gpt-oss-20b, with 117 billion and 210 billion parameters, respectively, licensed under the Apache 2.0 license, which allows developers to download, modify, and deploy them free of charge. gpt-oss...

SongGeneration: open-source AI model for generating high-quality music and lyrics
SongGeneration is a music generation model developed and open-sourced by Tencent AI Lab, focusing on generating high-quality songs, including lyrics, accompaniment and vocals. It is based on the LeVo framework, combining the language model LeLM and music codecs to support song generation in English and Chinese. The model is trained on a million-song dataset and can...
Top