Seed-OSS is a series of open source large language models developed by ByteDance's Seed team, focusing on long context processing, reasoning capability and agent task optimization. The models contain 36 billion parameters, trained with only 12 trillion tokens, with outstanding performance in many mainstream benchmarks, and support ultra-long context processing with 512K tokens, suitable for internationalized application scenarios.Seed-OSS provides flexible reasoning budget control, allowing users to adjust the reasoning length according to their needs, and improve the efficiency of practical applications. Seed-OSS adopts Apache-2.0 license and is completely open source, allowing developers to use and modify it freely, and it is widely used in research, reasoning tasks and multimodal scenarios, and it has been supported in more than 50 real-world applications of ByteDance.
Function List
- ultra-long context processing: Supports a context window of 512K tokens, which is equivalent to about 1600 pages of text, making it suitable for processing long documents or complex dialogs.
- Flexible reasoning for budgetary control: Users can access this information through the
thinking_budget
Parameters dynamically adjust inference length to balance speed and depth. - strong reasoning: Optimized for complex tasks such as math and code generation, performance is excellent in benchmarks such as AIME and LiveCodeBench.
- Internationalization Optimization: Supports multi-language tasks for developers worldwide, covering multiple languages for translation and understanding.
- Agent Mission Support: Built-in tool-calling functionality with
enable-auto-tool-choice
Automated task processing is possible. - Efficient deployment: Supports multi-GPU reasoning, compatible with
bfloat16
data types to optimize inference efficiency. - Open Source and Community Support: Based on the Apache-2.0 license, it provides full model weights and code for easy customization by developers.
Using Help
Installation process
To use the Seed-OSS model, follow the steps below to install and configure it locally or on a server. The following is an example of the Seed-OSS-36B-Instruct model, based on the official guide provided by GitHub.
- clone warehouse::
git clone https://github.com/ByteDance-Seed/seed-oss.git cd seed-oss
- Installation of dependencies::
Make sure Python 3.8+ and pip are installed on your system. run the following command to install the necessary dependencies:pip3 install -r requirements.txt pip install git+ssh://git@github.com/Fazziekey/transformers.git@seed-oss
- Install vLLM (recommended)::
Seed-OSS supports the vLLM reasoning framework for more efficient reasoning. Install vLLM:VLLM_USE_PRECOMPILED=1 VLLM_TEST_USE_PRECOMPILED_NIGHTLY_WHEEL=1 pip install git+ssh://git@github.com/FoolPlayer/vllm.git@seed-oss
- Download model weights::
Download Seed-OSS-36B-Instruct model weights from Hugging Face:huggingface-cli download ByteDance-Seed/Seed-OSS-36B-Instruct --local-dir ./Seed-OSS-36B-Instruct
- Configuring the runtime environment::
Ensure that your system has a hardware environment that supports multiple GPUs (e.g. NVIDIA H100). Recommended Configurationstensor-parallel-size=8
cap (a poem)bfloat16
data type to optimize performance. - Initiate reasoning service::
Use vLLM to start an OpenAI-compatible API service:python3 -m vllm.entrypoints.openai.api_server \ --host localhost \ --port 4321 \ --enable-auto-tool-choice \ --tool-call-parser seed_oss \ --trust-remote-code \ --model ./Seed-OSS-36B-Instruct \ --chat-template ./Seed-OSS-36B-Instruct/chat_template.jinja \ --tensor-parallel-size 8 \ --dtype bfloat16 \ --served-model-name seed_oss
Usage
Seed-OSS provides a variety of ways to use it, suitable for different scenarios. Below is the detailed operation flow of the main functions.
1. Basic dialog and reasoning
Use Python scripts to interact with the model. Take the example of generating a cooking tutorial:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "ByteDance-Seed/Seed-OSS-36B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
messages = [{"role": "user", "content": "How to make pasta?"}]
tokenized_chat = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
thinking_budget=512
)
outputs = model.generate(tokenized_chat.to(model.device), max_new_tokens=2048)
output_text = tokenizer.decode(outputs[0])
print(output_text)
- Key parameters::
thinking_budget=512
: Controls the depth of reasoning, the larger the value, the deeper the reasoning, suitable for complex tasks.max_new_tokens=2048
: Sets the maximum number of tokens to generate, which affects the length of the output.
2. Long contextualization
Seed-OSS supports 512K token contexts, which is suitable for processing long documents or multi-round conversations. For example, analyzing long reports:
- To use the content of a long document as a part of the
messages
Input, in the format[{"role": "user", "content": "<长文档内容>"}]
. - Setting High
thinking_budget
(e.g., 1024) to ensure deep inference. - Use the above script to generate summaries or answer questions.
3. Proxy tasks and tool calls
Seed-OSS supports automated tool invocation. enable-auto-tool-choice
For example, after configuring the API service, the model can be invoked via an HTTP request. For example, after you configure the API service, you can invoke the model via an HTTP request:
curl http://localhost:4321/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "seed_oss",
"messages": [{"role": "user", "content": "Calculate 2+2"}]
}'
- The model automatically selects the appropriate tool (e.g., a math calculator) and returns the results.
- assure
tool-call-parser seed_oss
Enabled to parse tool calls.
4. Reasoning about budget optimization
The user can adjust the thinking_budget
Parameter optimization inference efficiency:
- Simple tasks (e.g., translation): setup
thinking_budget=128
The - Complex tasks (e.g., mathematical reasoning): set up
thinking_budget=1024
The
Example:
tokenized_chat = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
thinking_budget=1024
)
5. Deployment optimization
- Multi-GPU Inference: By
tensor-parallel-size
parameter to allocate GPU resources. For example.tensor-parallel-size=8
Suitable for 8 GPUs. - data type: Use
bfloat16
Reduced video memory footprint for large-scale deployments. - Generating Configurations: Recommendations
temperature=1.1
cap (a poem)top_p=0.95
for diverse output. For specific tasks (e.g., Taubench), this can be adjusted totemperature=1
cap (a poem)top_p=0.7
The
caveat
- hardware requirement: At least 1 NVIDIA H100-80G GPU is recommended, with 4 supporting higher resolution tasks.
- Model Selection: Seed-OSS is available in Base and Instruct versions, with Instruct being more suitable for interactive tasks and Base for research and fine-tuning.
- Community Support: Contribute to the community by submitting an issue or pull request via GitHub.
application scenario
- academic research
- Scene Description: Researchers can use Seed-OSS for long document analysis, data extraction or complex reasoning tasks. For example, analyzing academic papers or generating summaries of research reports.
- multilingual application
- Scene Description: Developers can leverage the model's multilingual support to build internationalized chatbots or translation tools that cover multiple language scenarios.
- Automation Agents
- Scene Description: Organizations can deploy Seed-OSS as an intelligent agent to handle customer service, automated task scheduling or data analysis.
- code generation
- Scene Description: Programmers can use the model to generate code snippets or debug complex algorithms in conjunction with 512K contexts to process large code bases.
- Educational support
- Scene Description: Educational institutions can use the models to generate instructional materials, answer student questions, or provide personalized study guides.
QA
- What languages does Seed-OSS support?
- The model is optimized for internationalized scenarios and supports multiple languages, including English, Chinese, Spanish, etc. The specific performance can be found in the FLORES-200 benchmark test.
- How do I adjust my reasoning budget?
- Setting in the generation script
thinking_budget
parameter, ranging from 128 (for simple tasks) to 1024 (for complex tasks), adjusted according to task requirements.
- Setting in the generation script
- How much video memory is needed to run the model?
- A single H100-80G GPU can support basic inference, while 4 GPUs can handle higher load tasks. Recommended Usage
bfloat16
Reduced video memory requirements.
- A single H100-80G GPU can support basic inference, while 4 GPUs can handle higher load tasks. Recommended Usage
- How do I get involved in model development?
- Code can be submitted or issues can be fed back via the GitHub repository (https://github.com/ByteDance-Seed/seed-oss), under the Apache-2.0 license.