Current Position:fig. beginning " AI Tool

M3-Agent: a multimodal intelligence with long-term memory and capable of processing audio and video

2025-08-28

892 18

https://github.com/ByteDance-Seed/m3-agent

make a copy of

M3-Agent is a multimodal intelligent body framework developed by the ByteDance SEED team. Its core feature is a long-term memory capability, capable of building and constantly updating its memory by processing real-time video (watching) and audio (listening) inputs, just like humans. This memory system does not just record what happened (situational memory), but also extracts knowledge and concepts from it (semantic memory), such as recognizing different people, objects, and the relationships between them. The M3-Agent organizes this information into a multimodal knowledge graph centered on "entities" to provide a deeper and more coherent understanding of the environment. When receiving instructions or questions from the user, the M3-Agent is able to autonomously perform multiple rounds of thinking and reasoning, retrieve relevant information from a large long-term memory, and finally complete the task or give an answer. This technology solves the pain point of existing models that are difficult to process and memorize long video information, and has a wide range of applications in robotics, personal assistants, and other fields.

Function List

Multimodal Input Processing:: Ability to receive and understand real-time video and audio streams simultaneously.
Long-term memory construction: The ability to convert received information into long-term memory and is divided into two categories:
- situational memory:: Documentation of specific occurrences and original content.
- semantic memory: Distill abstract knowledge about entities (e.g., people, objects) and the relationships between them from events.
Entity-centered memory structure:: Memory is organized around a core of entities to form a multimodal knowledge graph that ensures consistency and relevance of information.
Autonomous Reasoning and Retrieval: Upon receiving a command, it can autonomously perform multiple rounds of iterative thinking and retrieve the most relevant information from its memory bank to aid in decision-making.
Enhanced Learning Optimization: Train intelligences in memory retrieval and reasoning through reinforcement learning to achieve a higher rate of task success.
Leading performance: Significantly higher accuracy than leading models such as Gemini-1.5-pro and GPT-4o in multiple long video quiz benchmarks.

Using Help

The operation of the M3-Agent is divided into two core processes:Memorization and Control.. The memory process is responsible for analyzing the video and building a knowledge base, while the control process is responsible for retrieving information from the knowledge base and generating answers based on user questions.

hardware requirement

Complete run (including memorization process):: Requires a server with 1 A100 (80GB video memory) or 4 RTX 3090 (24GB video memory).
Run reasoning only (control process): A GPU with at least 16GB of video memory is required.
disk space: A minimum of 200 GB of free space is required for model and process cache files.

Environmental settings

First, you need to clone the code repository and install the base environment.

# 执行设置脚本
bash setup.sh
# 安装特定版本的transformers库
pip install git+https://github.com/huggingface/transformers@f742a644ca32e65758c3adb36225aef1731bd2a8
# 安装Qwen-Omni工具库
pip install qwen-omni-utils==0.0.4

Step 1: Memorization process (Memorization)

This process is to transform the video content into a structured memory map, which is stored locally. If you use the officially provided M3-Bench dataset, you can skip some of the data processing steps and directly download the officially processed intermediate files and memory maps.

1. Video slicing
Since the model deals with short video clips, it needs to first cut the long video into 30-second segments.

#!/bin/bash
# 定义视频文件路径变量
video="robot/bedroom_01"
input="data/videos/$video.mp4"
# 创建用于存放切片的目录
mkdir -p "data/clips/$video"
# 获取视频总时长
duration=$(ffprobe -v error -show_entries format=duration -of default=noprint_wrappers=1:nokey=1 "$input")
duration_seconds=$(echo "$duration" | awk '{print int($1)}')
# 计算需要切成多少段
segments=$((duration_seconds / 30 + 1))
# 循环切片
for ((i=0; i<segments; i++)); do
start=$((i * 30))
output="data/clips/$video/$i.mp4"
# 使用ffmpeg命令进行切片
ffmpeg -ss $start -i "$input" -t 30 -c copy "${output}"
done

2. Preparation of data configuration files
Create a file in JSONL format, for exampledata/data.jsonl, each line describes information about a video.

{"id": "bedroom_01", "video_path": "data/videos/robot/bedroom_01.mp4", "clip_path": "data/videos/clips/bedroom_01", "mem_path": "data/videos/memory_graphs/bedroom_01.pkl", "intermediate_path": "data/videos/intermediate_outputs/robot/bedroom_01"}

3. Generation of intermediate outputs (optional)
This step will use face detection and speaker recognition tools to generate intermediate files for building memories. If you have downloaded the official processed from Hugging Faceintermediate_outputs, you can skip this step.

# 首先下载音频嵌入模型和speakerlab库存放到指定目录
# 结构应如下：
# m3-agent/
# ├── models/
# │   └── pretrained_eres2netv2.ckpt
# └── speakerlab/
python m3_agent/memorization_intermediate_outputs.py \
--data_file data/data.jsonl

4. Generating memory maps
M3-Agent-Memorization was used to generate the final memory mapping file (.pkl(Format).

# 首先从Hugging Face下载M3-Agent-Memorization模型
python m3_agent/memorization_memory_graphs.py \
--data_file data/data.jsonl

Step 2: Control process (Control)

Once the memory map has been generated, the question and answer process can begin.

1. Additional environmental settings
The control process requires a specific version of the library.

bash setup.sh
pip install transformers==4.51.0
pip install vllm==0.8.4
pip install numpy==1.26.4

2. Questions and assessments
Answers are generated using the M3-Agent's control model (M3-Agent-Control) and the quality of the answers can be evaluated using the GPT-4o.

# 首先从Hugging Face下载M3-Agent-Control模型
python m3_agent/control.py \
--data_file data/annotations/robot.json

included among thesedata/annotations/robot.jsonThe file contains questions to ask in response to the video. You can modify this file to ask your own questions.

Visual Memory Mapping

You can also visualize the generated memory maps to visualize the content of the intelligentsia's memories.

python visualization.py \
--mem_path data/memory_graphs/robot/bedroom_01.pkl \
--clip_id 1

application scenario

Smart Home Robot
A home service robot equipped with M3-Agent can memorize each family member's habits and common locations of objects by continuously observing the home environment and members' activities. For example, when the owner asks "Where did I put my glasses?" For example, when the owner asks "Where did I put my glasses?", the robot can recall where it last saw the glasses and tell the owner. It also remembers the owner's coffee drinking habit in the morning and takes the initiative to prepare the coffee at a specific time.
personal digital assistant (PDA)
M3-Agent can act as a super-assistant that organizes and remembers all of the user's digitized information, including video conferences, voice calls, web pages viewed, and more. When a user needs to find out a detail that was discussed in a meeting a few weeks ago, he or she can simply ask a question in natural language, and the assistant will be able to accurately pull up the relevant snippet of information from long-term memory.
Automated content analysis
For security fields that deal with large amounts of video surveillance, M3-Agent can automatically analyze days or even months of footage to create a timeline and knowledge base of scenes, people and activities. When investigating a specific event, analysts no longer need to manually watch massive amounts of video, but can directly ask the system questions, such as "query all the clips of people wearing red clothes in the past week", and the system can quickly return all the relevant footage.

QA

What are the core differences between M3-Agent and large language models like GPT-4o?
The core difference is that the M3-Agent has a specially designed external long-term memory system; the memory of a model like GPT-4o is mainly limited to the context window of the current conversation, and will be "forgotten" after the conversation is over. M3-Agent, on the other hand, can continuously store the information it perceives through the camera and microphone into a structured memory bank, like a human being, and can retrieve and reason about it at any time in the future, realizing memory across time and tasks.
How does the M3-Agent "memory map" work?
Memory graph is a network-like data structure.M3-Agent recognizes key "entities" (e.g., people, objects) from video and audio and uses them as nodes in the graph. It then records the states, behaviors, and relationships between these entities at different times and events, and this information serves as the edges connecting the nodes. This approach makes memory less of a fragmented piece and more of an interconnected network of knowledge, very conducive to complex reasoning.
Is there a high technical barrier to deploying and using M3-Agent?
For non-technical users, direct deployment has a certain threshold, as it requires familiarity with the Linux command line, Python environment, and configuration of deep learning models. In addition, the hardware requirements are high, especially the memory generation phase which requires strong GPU support. However, for developers and researchers, the project provides detailed installation and running scripts, and the deployment can be completed relatively smoothly by following the steps in the official documentation.

AI open source project Knowledge Retrieval and the RAG Framework

AI productivity tools " M3-Agent: a multimodal intelligence with long-term memory and capable of processing audio and video Posted on 2025-08-28, if you find the URL is out of date, or inaccessible, please contact us.

0Bookmarked

0kudos

M3-Agent: a multimodal intelligence with long-term memory and capable of processing audio and video

Function List

Using Help

hardware requirement

Environmental settings

Step 1: Memorization process (Memorization)

Step 2: Control process (Control)

Visual Memory Mapping

application scenario

QA

Related articles

Recommended

Can't find AI tools? Try here!

Popular AI tools

New Releases

Latest AI tools

M3-Agent: a multimodal intelligence with long-term memory and capable of processing audio and video

Function List

Using Help

hardware requirement

Environmental settings

Step 1: Memorization process (Memorization)

Step 2: Control process (Control)

Visual Memory Mapping

application scenario

QA

Related articles

Recommended

Can't find AI tools? Try here!

Popular AI tools

New Releases

Latest AI tools

Quick query station AI tool