OpenBench is an open source language model evaluation tool that is not restricted to a specific model vendor. Developers can use it to perform standardized, reproducible performance evaluations of language models on more than 20 benchmark test sets covering a wide range of domains such as knowledge, reasoning, coding, and mathematics.OpenBench's core strengths are its simplicity and versatility. It provides a simple command-line interface that allows users to launch evaluation tasks with just a few commands. The tool supports a wide range of mainstream modeling service providers, such as Groq, OpenAI, Anthropic, Google, and is also compatible with native models run through Ollama. Since it is built oninspect-ai
The OpenBench is built on top of the framework and is therefore extensible, allowing developers to easily add new benchmarks and evaluation metrics. This makes OpenBench a flexible and easy-to-use platform for model performance evaluation.
Function List
- Supports over 20 benchmarks: Built-in MMLU, GPQA, HumanEval, SimpleQA, and a variety of competition-level math assessments such as AIME and HMMT.
- Simple Command Line Interface (CLI): Provided
bench list
,bench describe
,bench eval
and other simple and intuitive commands to manage and run reviews. - Compatible with multiple model suppliers: Support for over 15 modeling vendors, including Groq, OpenAI, Anthropic, Google, AWS Bedrock, Azure, and more.
- Support for local models: Can be integrated with Ollama to evaluate language models running locally.
- Built on a standardized framework: Built on
inspect-ai
The assessment framework is on top of the assessment framework, which ensures the normality and reliability of the assessment. - Highly scalable: Allows developers to easily add new benchmarking projects and custom evaluation metrics.
- Interactive results view: Provided
bench view
command, you can view the evaluation log in the interactive user interface. - Flexible evaluation configuration: Users can configure the model evaluation process in detail through command line parameters or environment variables, such as setting the temperature, the maximum number of Token, the number of concurrent requests, and so on.
Using Help
OpenBench provides a complete set of tools for standardized benchmarking of large language models (LLMs). The following section describes in detail how to install and use the tool to evaluate the models.
1. Environment preparation and installation
Before you can use OpenBench, you need to install the uv
It is a fast Python package installer and virtual environment manager.
Step 1: Install uv (if not already installed)
uv
The installation of OpenBench is very simple, and you can refer to its official documentation. Once the installation is complete, you can start preparing your OpenBench environment.
Step 2: Create and activate the virtual environment
To keep project dependencies isolated, it is recommended to create a new virtual environment.
# 创建一个名为 .venv 的虚拟环境
uv venv
# 激活该虚拟环境 (在Linux或macOS上)
source .venv/bin/activate
Step 3: Install OpenBench
After activating the virtual environment, use the uv
to install OpenBench.
uv pip install openbench```
这个命令会自动处理所有必需的依赖项。
### **2. 配置 API 密钥**
OpenBench 支持多家模型供应商,你需要设置相应的 API 密钥才能使用它们的模型。密钥通过环境变量进行配置。
```bash
# 示例:设置 Groq 的 API 密钥
export GROQ_API_KEY="你的密钥"
# 示例:设置 OpenAI 的 API 密钥
export OPENAI_API_KEY="你的密钥"
# 示例:设置 Anthropic 的 API 密钥
export ANTHROPIC_API_KEY="你的密钥"```
你只需要设置你计划使用的那个供应商的密钥即可。
### **3. 运行评估任务**
配置完成后,你就可以通过 `bench eval` 命令来运行一个评估任务。
**基本命令格式:**
`bench eval <基准测试名称> --model <模型名称>`
**快速上手示例:**
让我们以`mmlu`基准测试为例,使用Groq提供的`llama-3.3-70b-versatile`模型,并只评估10个样本。
```bash
bench eval mmlu --model groq/llama-3.3-70b-versatile --limit 10
mmlu
: This is the name of the benchmark test.--model groq/llama-3.3-70b-versatile
: Specifies the model to be evaluated.--limit 10
: indicates that only 10 samples from the data set were selected for testing, which allows for quick results on the first attempt.
After the evaluation task is completed, the results are saved by default in the project directory under the ./logs/
folder.
4. Viewing the results of the assessment
You have two ways to view the results:
Way 1: directly view the log file
The result logs are plain text or JSON files that you can open directly with a text editor located in the ./logs/
directory to view the log files.
Approach 2: Use of interactive interfaces
OpenBench provides a more user-friendly and interactive interface for presenting results.
bench view
```该命令会启动一个本地服务,让你可以在浏览器中直观地浏览和分析历次评估的结果。
### **5. 主要命令和常用选项**
OpenBench 的核心功能通过 `bench` 命令暴露。
- `bench --help`: 显示所有可用的命令和全局选项。
- `bench list`: 列出所有可用的基准测试、模型和标志。
- `bench eval <benchmark>`: 运行指定的基准测试。
- `bench view`: 启动交互式界面查看日志。
#### **`eval` 命令的关键选项**
`eval` 命令支持丰富的选项来控制评估流程,你可以通过命令行参数或环境变量来设置。
| 选项 | 环境变量 | 描述 |
| --- | --- | --- |
| `--model` | `BENCH_MODEL` | 指定要评估的一个或多个模型。 |
| `--limit` | `BENCH_LIMIT` | 限制评估样本的数量,可以是具体数字或范围(如`10,20`)。 |
| `--temperature` | `BENCH_TEMPERATURE` | 设置模型的生成温度,影响输出的随机性。 |
| `--max-connections`| `BENCH_MAX_CONNECTIONS`| 设置与模型API的最大并行连接数,默认为10。 |
| `--logfile` | `BENCH_OUTPUT` | 指定保存结果的日志文件路径。 |
| `--sandbox` | `BENCH_SANDBOX` | 指定代码执行的环境,如`local`或`docker`,用于HumanEval等编码测试。 |
| `--json` | 无 | 如果设置此项,结果将以JSON格式输出。 |
### **6. 使用不同供应商或本地模型**
你可以轻松切换不同的模型供应商。
```bash
# 使用 OpenAI 的模型
bench eval humaneval --model openai/o3-2025-04-16
# 使用 Google 的模型
bench eval mmlu --model google/gemini-2.5-pro
# 使用通过 Ollama 运行的本地模型
# 确保 Ollama 服务正在运行
bench eval musr --model ollama/llama3.1:70b
7. Handling of Hugging Face dataset downloads
Some benchmarks require a dataset to be downloaded from Hugging Face. If you encounter a "gated" error, the dataset requires user authentication. You will need to set up an access token for Hugging Face.
export HF_TOKEN="你的HuggingFace令牌"
After completing the above steps, re-run the bench eval
command solves the problem.
application scenario
- Modeling Research and Development
Researchers and developers developing new language models can use OpenBench to quickly test the performance of new models on multiple industry-standard benchmarks and quantitatively compare them with existing mainstream models to validate model improvements. - Model Selection and Procurement
Enterprises or teams choosing the right language model for their business can utilize OpenBench to conduct uniform and fair performance evaluation of candidate models from different vendors (e.g. OpenAI, Google, Anthropic) for data-driven decision making. - Continuous Integration and Regression Testing
For scenarios that require frequent fine-tuning or iteration of models, OpenBench can be integrated into the CI/CD process. Whenever a model is updated, a standardized set of benchmarks is automatically run to ensure that there is no unexpected degradation in model performance. - Local model performance evaluation
For scenarios that focus on data privacy or need to run offline, developers can use Ollama to deploy open source models locally.OpenBench can connect to the local Ollama service to fully evaluate the knowledge, reasoning and coding capabilities of these local models.
QA
- What is the difference between OpenBench and Inspect AI?
OpenBench is a benchmarking library built on top of the Inspect AI framework. It can be understood that Inspect AI provides the underlying evaluation capabilities and tools, on top of which OpenBench provides off-the-shelf implementations of more than 20 mainstream benchmarks, unified command line tools, and utilities (e.g., mathematical scorers) that are shared across reviews.OpenBench is focused on streamlining the process of running standardized benchmarks and enhancing the developer experience. - Why choose OpenBench over other tools like lm-evaluation-harness or lighteval?
While each of these tools has its own focus, OpenBench's main strength is its clear, easy-to-understand and easy-to-modify implementation of the benchmarks. It reduces code duplication across benchmarks through shared components and optimizes the developer experience through clean command-line tools and consistent design patterns. If you need a tool that is easy to extend and maintain, with highly readable evaluation code, OpenBench is a good choice. - How to use it outside of a virtual environment
bench
Orders?
If you want to be able to call directly from any path in the systembench
command, instead of activating the virtual environment each time, you can run the following command to install in editable mode after the project is cloned locally:uv run pip install -e .
The - Running a review prompts HuggingFace to require a login, how do I fix this?
This is usually because the dataset needed for the review is protected (gated) on HuggingFace. You need to get a HuggingFace access token, then set the environment variableHF_TOKEN
to resolve. Example:export HF_TOKEN="hf_xxxxxxxx"
, and then just re-run the review command afterward.