Current Position:fig. beginning " AI Professional Tools

OpenBench: an open source benchmarking tool for evaluating language models

2025-08-01

841 1

make a copy of

OpenBench is an open source language model evaluation tool that is not restricted to a specific model vendor. Developers can use it to perform standardized, reproducible performance evaluations of language models on more than 20 benchmark test sets covering a wide range of domains such as knowledge, reasoning, coding, and mathematics.OpenBench's core strengths are its simplicity and versatility. It provides a simple command-line interface that allows users to launch evaluation tasks with just a few commands. The tool supports a wide range of mainstream modeling service providers, such as Groq, OpenAI, Anthropic, Google, and is also compatible with native models run through Ollama. Since it is built oninspect-aiThe OpenBench is built on top of the framework and is therefore extensible, allowing developers to easily add new benchmarks and evaluation metrics. This makes OpenBench a flexible and easy-to-use platform for model performance evaluation.

Function List

Supports over 20 benchmarks: Built-in MMLU, GPQA, HumanEval, SimpleQA, and a variety of competition-level math assessments such as AIME and HMMT.
Simple Command Line Interface (CLI): Providedbench list,bench describe,bench evaland other simple and intuitive commands to manage and run reviews.
Compatible with multiple model suppliers: Support for over 15 modeling vendors, including Groq, OpenAI, Anthropic, Google, AWS Bedrock, Azure, and more.
Support for local models: Can be integrated with Ollama to evaluate language models running locally.
Built on a standardized framework: Built oninspect-aiThe assessment framework is on top of the assessment framework, which ensures the normality and reliability of the assessment.
Highly scalable: Allows developers to easily add new benchmarking projects and custom evaluation metrics.
Interactive results view: Providedbench viewcommand, you can view the evaluation log in the interactive user interface.
Flexible evaluation configuration: Users can configure the model evaluation process in detail through command line parameters or environment variables, such as setting the temperature, the maximum number of Token, the number of concurrent requests, and so on.

Using Help

OpenBench provides a complete set of tools for standardized benchmarking of large language models (LLMs). The following section describes in detail how to install and use the tool to evaluate the models.

1. Environment preparation and installation

Before you can use OpenBench, you need to install the uvIt is a fast Python package installer and virtual environment manager.

Step 1: Install uv (if not already installed)
uv The installation of OpenBench is very simple, and you can refer to its official documentation. Once the installation is complete, you can start preparing your OpenBench environment.

Step 2: Create and activate the virtual environment
To keep project dependencies isolated, it is recommended to create a new virtual environment.

# 创建一个名为 .venv 的虚拟环境
uv venv
# 激活该虚拟环境 (在Linux或macOS上)
source .venv/bin/activate

Step 3: Install OpenBench
After activating the virtual environment, use the uv to install OpenBench.

uv pip install openbench```
这个命令会自动处理所有必需的依赖项。
### **2. 配置 API 密钥**
OpenBench 支持多家模型供应商，你需要设置相应的 API 密钥才能使用它们的模型。密钥通过环境变量进行配置。
```bash
# 示例：设置 Groq 的 API 密钥
export GROQ_API_KEY="你的密钥"
# 示例：设置 OpenAI 的 API 密钥
export OPENAI_API_KEY="你的密钥"
# 示例：设置 Anthropic 的 API 密钥
export ANTHROPIC_API_KEY="你的密钥"```
你只需要设置你计划使用的那个供应商的密钥即可。
### **3. 运行评估任务**
配置完成后，你就可以通过 `bench eval` 命令来运行一个评估任务。
**基本命令格式：**
`bench eval <基准测试名称> --model <模型名称>`
**快速上手示例：**
让我们以`mmlu`基准测试为例，使用Groq提供的`llama-3.3-70b-versatile`模型，并只评估10个样本。
```bash
bench eval mmlu --model groq/llama-3.3-70b-versatile --limit 10

mmlu: This is the name of the benchmark test.
--model groq/llama-3.3-70b-versatile: Specifies the model to be evaluated.
--limit 10: indicates that only 10 samples from the data set were selected for testing, which allows for quick results on the first attempt.

After the evaluation task is completed, the results are saved by default in the project directory under the ./logs/ folder.

4. Viewing the results of the assessment

You have two ways to view the results:

Way 1: directly view the log file
The result logs are plain text or JSON files that you can open directly with a text editor located in the ./logs/ directory to view the log files.

Approach 2: Use of interactive interfaces
OpenBench provides a more user-friendly and interactive interface for presenting results.

bench view
```该命令会启动一个本地服务，让你可以在浏览器中直观地浏览和分析历次评估的结果。
### **5. 主要命令和常用选项**
OpenBench 的核心功能通过 `bench` 命令暴露。
- `bench --help`: 显示所有可用的命令和全局选项。
- `bench list`: 列出所有可用的基准测试、模型和标志。
- `bench eval <benchmark>`: 运行指定的基准测试。
- `bench view`: 启动交互式界面查看日志。
#### **`eval` 命令的关键选项**
`eval` 命令支持丰富的选项来控制评估流程，你可以通过命令行参数或环境变量来设置。
| 选项 | 环境变量 | 描述 |
| --- | --- | --- |
| `--model` | `BENCH_MODEL` | 指定要评估的一个或多个模型。 |
| `--limit` | `BENCH_LIMIT` | 限制评估样本的数量，可以是具体数字或范围（如`10,20`）。 |
| `--temperature` | `BENCH_TEMPERATURE` | 设置模型的生成温度，影响输出的随机性。 |
| `--max-connections`| `BENCH_MAX_CONNECTIONS`| 设置与模型API的最大并行连接数，默认为10。 |
| `--logfile` | `BENCH_OUTPUT` | 指定保存结果的日志文件路径。 |
| `--sandbox` | `BENCH_SANDBOX` | 指定代码执行的环境，如`local`或`docker`，用于HumanEval等编码测试。 |
| `--json` | 无 | 如果设置此项，结果将以JSON格式输出。 |
### **6. 使用不同供应商或本地模型**
你可以轻松切换不同的模型供应商。
```bash
# 使用 OpenAI 的模型
bench eval humaneval --model openai/o3-2025-04-16
# 使用 Google 的模型
bench eval mmlu --model google/gemini-2.5-pro
# 使用通过 Ollama 运行的本地模型
# 确保 Ollama 服务正在运行
bench eval musr --model ollama/llama3.1:70b

7. Handling of Hugging Face dataset downloads

Some benchmarks require a dataset to be downloaded from Hugging Face. If you encounter a "gated" error, the dataset requires user authentication. You will need to set up an access token for Hugging Face.

export HF_TOKEN="你的HuggingFace令牌"

After completing the above steps, re-run the bench eval command solves the problem.

application scenario

Modeling Research and Development
Researchers and developers developing new language models can use OpenBench to quickly test the performance of new models on multiple industry-standard benchmarks and quantitatively compare them with existing mainstream models to validate model improvements.
Model Selection and Procurement
Enterprises or teams choosing the right language model for their business can utilize OpenBench to conduct uniform and fair performance evaluation of candidate models from different vendors (e.g. OpenAI, Google, Anthropic) for data-driven decision making.
Continuous Integration and Regression Testing
For scenarios that require frequent fine-tuning or iteration of models, OpenBench can be integrated into the CI/CD process. Whenever a model is updated, a standardized set of benchmarks is automatically run to ensure that there is no unexpected degradation in model performance.
Local model performance evaluation
For scenarios that focus on data privacy or need to run offline, developers can use Ollama to deploy open source models locally.OpenBench can connect to the local Ollama service to fully evaluate the knowledge, reasoning and coding capabilities of these local models.

QA

What is the difference between OpenBench and Inspect AI?
OpenBench is a benchmarking library built on top of the Inspect AI framework. It can be understood that Inspect AI provides the underlying evaluation capabilities and tools, on top of which OpenBench provides off-the-shelf implementations of more than 20 mainstream benchmarks, unified command line tools, and utilities (e.g., mathematical scorers) that are shared across reviews.OpenBench is focused on streamlining the process of running standardized benchmarks and enhancing the developer experience.
Why choose OpenBench over other tools like lm-evaluation-harness or lighteval?
While each of these tools has its own focus, OpenBench's main strength is its clear, easy-to-understand and easy-to-modify implementation of the benchmarks. It reduces code duplication across benchmarks through shared components and optimizes the developer experience through clean command-line tools and consistent design patterns. If you need a tool that is easy to extend and maintain, with highly readable evaluation code, OpenBench is a good choice.
How to use it outside of a virtual environment bench Orders?
If you want to be able to call directly from any path in the system bench command, instead of activating the virtual environment each time, you can run the following command to install in editable mode after the project is cloned locally: uv run pip install -e .The
Running a review prompts HuggingFace to require a login, how do I fix this?
This is usually because the dataset needed for the review is protected (gated) on HuggingFace. You need to get a HuggingFace access token, then set the environment variable HF_TOKEN to resolve. Example:export HF_TOKEN="hf_xxxxxxxx", and then just re-run the review command afterward.

AI open source project

AI productivity tools " OpenBench: an open source benchmarking tool for evaluating language models Posted on 2025-08-01, please contact us if you find the URL is out of date, or inaccessible.

0Bookmarked

0kudos

OpenBench: an open source benchmarking tool for evaluating language models

Function List

Using Help

1. Environment preparation and installation

4. Viewing the results of the assessment

7. Handling of Hugging Face dataset downloads

application scenario

QA

Recommended

Can't find AI tools? Try here!

Popular AI tools

New Releases

Latest AI tools

OpenBench: an open source benchmarking tool for evaluating language models

Function List

Using Help

1. Environment preparation and installation

4. Viewing the results of the assessment

7. Handling of Hugging Face dataset downloads

application scenario

QA

Recommended

Can't find AI tools? Try here!

Popular AI tools

New Releases

Latest AI tools

Quick query station AI tool