BAGEL is an open-source multimodal base model developed by the ByteDance Seed team and hosted on GitHub, which integrates text comprehension, image generation, and editing capabilities to support cross-modal tasks. The model has 7B active parameters (out of a total of 14B parameters) and is trained on large-scale interleaved multimodal data using the Mixture-of-Transformer-Experts (MoT) architecture.BAGEL performs well on multimodal comprehension and generation tasks, outperforming the Qwen2.5-VL It supports open source models such as SD3 and InternVL-2.5, and the quality of image generation is comparable to SD3. It supports advanced features such as free-form image editing, video sequence generation, and 3D spatial understanding for developers and researchers exploring AI applications. The project provides detailed installation and reasoning guides for users to get started quickly.
Function List
- Supports text-to-image generation, which can generate high-quality images based on text prompts.
- Provides an image comprehension function that enables you to analyze image content and answer related questions.
- Supports free-form image editing, modifying image details through text commands.
- Realize video sequence generation, which can generate dynamic video content based on text.
- Provides multimodal reasoning capabilities to fuse text, image and video data for complex tasks.
- Supports 3D spatial understanding for multi-view compositing and world navigation tasks.
- Provides evaluation scripts for visual language modeling (VLM), text-to-image (T2I), and image editing benchmarking.
- Open source code and model weights that allow user-defined training and fine-tuning.
Using Help
Installation process
To use BAGEL, you need to install and configure the relevant dependencies in your local environment. The following are the detailed installation steps:
- clone warehouse
Clone the BAGEL project locally using Git:git clone https://github.com/bytedance-seed/BAGEL.git cd BAGEL
- Creating a Virtual Environment
Create a Python 3.10 environment with Conda and activate it:conda create -n bagel python=3.10 -y conda activate bagel
- Installation of dependencies
Install the necessary Python libraries by running the following command in your project directory:pip install -r requirements.txt
- Download model weights
BAGEL's model weights are hosted on Hugging Face. Run the following Python script to download the model:from huggingface_hub import snapshot_download save_dir = "/path/to/save/BAGEL-7B-MoT" repo_id = "ByteDance-Seed/BAGEL-7B-MoT" cache_dir = save_dir + "/cache" snapshot_download( cache_dir=cache_dir, local_dir=save_dir, repo_id=repo_id, local_dir_use_symlinks=False, resume_download=True, allow_patterns=["*.json", "*.safetensors", "*.bin", "*.py", "*.md", "*.txt"] )
commander-in-chief (military)
/path/to/save/BAGEL-7B-MoT
Replace with the local path where you wish to save the model. - Verify Installation
After the installation is complete, open the project'sinference.ipynb
file, follow the guidelines in the notebook to run the sample code and verify that the model loads properly.
Usage
The core functions of BAGEL are invoked via Jupyter Notebook or Python scripts. Below is a detailed flow of how the main functions work:
1. Text-to-image generation
BAGEL supports the generation of images from text prompts. For example inference.ipynb
After loading the model in, enter the following code:
prompt = "一张夕阳下的海滩,椰树摇曳,浪花拍岸"
image = model.generate_image(prompt)
image.save("output/beach_sunset.png")
- procedure::
- Make sure the model is loaded.
- Enter a text prompt in notebook.
- Run the generation code and the model will output the image and save it to the specified path.
- Checks the quality and content of the output image for compliance with the prompts.
2. Image comprehension
BAGEL can analyze images and answer related questions. For example, upload an image and ask a question:
image_path = "sample_image.jpg"
question = "图片中的主要物体是什么?"
answer = model.analyze_image(image_path, question)
print(answer)
- procedure::
- Prepare an image and specify the path.
- Enter the question and run the code.
- The model returns a response based on the content of the image, e.g. "The main object in the picture is a cat".
3. Image editing
BAGEL supports the editing of images via text commands. For example, replacing the background in an image with a forest:
image_path = "input_image.jpg"
instruction = "将背景替换为郁郁葱葱的森林"
edited_image = model.edit_image(image_path, instruction)
edited_image.save("output/edited_forest.png")
- procedure::
- Upload the image to be edited.
- Enter specific editing instructions.
- Run the code and check if the output image meets the requirements.
- Note: Current image editing may result in a loss of sharpness, but the effect is still being optimized.
4. Video sequence generation
BAGEL supports text-based generation of video sequences. Example:
prompt = "一只猫在草地上追逐蝴蝶"
video = model.generate_video(prompt)
video.save("output/cat_chasing_butterfly.mp4")
- procedure::
- Enter the video generation prompt.
- Running the generated code, the model outputs short video sequences.
- Check that the video content matches the description.
5. Assessing model performance
BAGEL provides evaluation scripts to test the performance of the model in visual language comprehension, image generation and editing tasks. Run the evaluation:
cd EVAL
python run_benchmarks.py
- procedure::
- go into
EVAL
Catalog. - Execute the evaluation script to see how the model performs in standard benchmark tests.
- The results are displayed in the terminal or saved as a log file.
- go into
caveat
- Ensure hardware support: BAGEL requires GPU acceleration, NVIDIA GPUs with at least 16GB of video memory are recommended.
- Check network connection: downloading model weights requires a stable network.
- Reference documentation: the project's
README.md
cap (a poem)inference.ipynb
Provides detailed code examples and parameter descriptions. - Community support: For questions, submit an issue on the GitHub Issues page, or refer to the discussion on Hugging Face. [](https://github.com/ByteDance-Seed/Bagel)
application scenario
- content creation
BAGEL can be used to generate blog graphics, social media content or video clips. Creators enter text descriptions and quickly generate images or short videos that fit the theme, saving design time. - Education and Research
Researchers can use BAGEL to conduct multimodal AI experiments that test the ability of text to interact with images. Students can learn about the development and deployment of AI models through open source code. - Product Prototyping
Developers can build interactive applications based on BAGEL, such as intelligent image editing tools or text-based video generation applications for rapid product prototyping. - game development
BAGEL's 3D spatial understanding and image generation capabilities can be used to generate game scenes or dynamic material, reducing development costs.
QA
- What languages does BAGEL support?
BAGEL mainly supports English and Chinese text input and output. Support for other languages may be less effective due to training data limitations. - How much computing resources are needed?
To run BAGEL, it is recommended to use a GPU with at least 16GB of video memory; the CPU may run slower and is not suitable for generating tasks. - How can I contribute code or improve the model?
Pull Requests can be submitted in the GitHub repository. training and fine-tuning documentation will be available soon, refer to theREADME.md
Get Updates. - What is the quality of image generation?
BAGEL's image generation quality is close to that of SD3, but may require further optimization in complex scenes or at high resolutions.