Verifiers is a library of modular components for creating Reinforcement Learning (RL) environments and training Large Language Model (LLM) agents. The goal of this project is to provide a set of reliable tools that allow developers to easily build, train, and evaluate LLM agents. Verifiers contains a library based on the transformers
Trainer implementation of the asynchronous GRPO (Generalized Reinforcement Learning with Policy Optimization) trainer, and got the prime-rl
project support, which can be used for large-scale FSDP (Fully Sharded Data Parallel) training. In addition to reinforcement learning training, Verifiers can also be used directly to build LLM evaluations, create synthetic data pipelines, and implement agent control programs. The project aims to be a reliable toolkit that minimizes the "forked codebase proliferation" problem common in the reinforcement learning infrastructure ecosystem, and provides a stable development base for developers.
Function List
- Modular environment components: provides a modular set of components for building reinforcement learning environments, making it easier to create and customize environments.
- Multiple environment type support:
SingleTurnEnv
: For tasks that require only a single response from the model per cue.ToolEnv
: Support for building agent loops utilizing the model's native tool or function call capabilities.MultiTurnEnv
:: Provides an interface for writing customized environmental interaction protocols for multi-round dialogs or interactive tasks.
- built-in trainer: Contains a
GRPOTrainer
It usesvLLM
Inference, support for running via Accelerate/DeepSpeed GRPO Intensive learning training in style. - command-line tool:: Provides practical command-line tools to streamline workflow:
vf-init
: Initialize a new environment module template.vf-install
: Install the environment module into the current project.vf-eval
: Rapidly assess environments using API models.
- Integration & Compatibility: can be easily integrated into any reinforcement learning framework that supports an OpenAI-compatible inference client, and natively supports the use of the
prime-rl
working together for more efficient and larger scale training. - Flexible incentives:: Adoption
Rubric
Classes encapsulating one or more reward functions can define complex evaluation criteria for model-generated completions.
Using Help
The Verifiers library proposes to work with uv
Package Manager together in your project.
1. Installation
First, you need to create a new virtual environment and activate it.
# 安装 uv
curl -LsSf https://astral.sh/uv/install.sh | sh
# 初始化一个新项目
uv init
# 激活虚拟环境
source .venv/bin/activate
Next, install Verifiers according to your needs:
- Local Development and Evaluation (CPU): If you only use the API model for development and evaluation, installing the core library is sufficient.
# 安装核心库 uv add verifiers # 如果需要 Jupyter 和测试支持 uv add 'verifiers[dev]'
- GPU Training: If you plan to use
vf.GRPOTrainer
For model training on GPUs, you need to install the version with all the dependencies and additionally install theflash-attn
Theuv add 'verifiers[all]' && uv pip install flash-attn --no-build-isolation
- Use the latest development version: You can also download the
main
Branch mounting.uv add verifiers @ git+https://github.com/willccbb/verifiers.git
- Installation from source (core library development): If you need to modify the Verifiers core library, you can install it from source.
git clone https://github.com/willccbb/verifiers.git cd verifiers uv sync --all-extras && uv pip install flash-attn --no-build-isolation uv run pre-commit install
2. Creating and managing the environment
Verifiers treats each reinforcement learning environment as an installable Python module.
- Initialize a new environment: Use
vf-init
command creates a new environment template.# 创建一个名为 my-new-env 的环境 vf-init my-new-env
This command will add a new command to the
environments/my-new-env
directory to generate a file containing thepyproject.toml
and basic structure of the environment template. - installation environment: Once created, use the
vf-install
Install it into your Python environment so that it can be imported and used.# 安装本地环境 vf-install my-new-env # 你也可以直接从 verifiers 官方仓库安装示例环境 vf-install vf-math-python --from-repo
3. Environment of use
After installing the environment, you can use the vf.load_environment
function loads it and evaluates or trains it.
- Loading environment:
import verifiers as vf # 加载已安装的环境,并传入必要的参数 vf_env = vf.load_environment("my-new-env", **env_args)
- Rapid assessment of the environment: Use
vf-eval
command to quickly test your environment. It defaults to using thegpt-4.1-mini
model with 3 rollouts for each of the 5 cues.# 对名为 my-new-env 的环境进行评估 vf-eval my-new-env
4. Core elements of the environment
A Verifiers environment consists of the following main components:
- Datasets: A Hugging Face dataset must contain a
prompt
columns as input. - Rollout logic: The way in which the model interacts with the environment, for example, in the
MultiTurnEnv
defined inenv_response
cap (a poem)is_completed
Methods. - Evaluation Criteria (Rubrics): Used to encapsulate one or more reward functions that score the output of the model.
- Parsers: Optional component to encapsulate reusable parsing logic.
5. Training models
Verifiers offers two main types of training:
- Using the built-in
GRPOTrainer
:
This trainer is suitable for efficiently training dense on 2-16 GPUs. Transformer Model.# 步骤1: 启动 vLLM 推理服务器 (shell 0) # 假设使用7个GPU进行数据并行 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6 vf-vllm --model your-model-name \ --data-parallel-size 7 --enforce-eager --disable-log-requests # 步骤2: 启动训练脚本 (shell 1) # 使用剩余的GPU进行训练 CUDA_VISIBLE_DEVICES=7 accelerate launch --num-processes 1 \ --config-file configs/zero3.yaml examples/grpo/train_script.py --size 1.7B
- utilization
prime-rl
(Recommended):
prime-rl
is an external project that natively supports environments created with Verifiers and provides better performance and scalability through FSDP. It has a more mature configuration and user experience.# 在 prime-rl 的配置文件中指定环境 # orch.toml [environment] id = "your-env-name" # 启动 prime-rl 训练 uv run rl \ --trainer @ configs/your_exp/train.toml \ --orchestrator @ configs/your_exp/orch.toml \ --inference @ configs/your_exp/infer.toml
application scenario
- Training task-specific intelligences
utilizationToolEnv
maybeMultiTurnEnv
Developers can create complex interactive environments and train LLM intelligences to learn how to use external tools (e.g., calculators, search engines) or to accomplish specific tasks (e.g., booking airline tickets, customer support) in multi-round conversations. - Building an automated assessment process
SingleTurnEnv
can be used to build automated assessment processes. By defining an automated assessment process that contains standardized answers and assessment criteria (Rubric
) environment that allows quantitative comparisons of the performance of different models, e.g., evaluating the correctness of a code generation task or the quality of a text summary. - Generate high quality synthetic data
A large amount of data on model-environment interactions can be generated through the environment interaction (rollout) process. This data can be saved as Hugging Face datasets and used for subsequent supervised fine-tuning (SFT) or other model training, an efficient pipeline for synthetic data generation. - Academic research and algorithm validation
Verifiers provides a modular, reproducible experimentation platform for reinforcement learning researchers. Researchers can easily implement new interaction protocols, reward functions, or training algorithms and verify their effectiveness in a standardized environment.
QA
- What does the Verifiers library have to do with prime-rl?
prime-rl
is a standalone training framework that natively supports environments created using Verifiers. Verifiers specializes in providing components for building RL environments, while theprime-rl
Instead, it focuses on providing a more powerful, better performing, and better scaling FSDP (Fully Segmented Data Parallelism) training solution. For large-scale training, the official recommendation is to useprime-rl
The - How do I define a bonus function for my environment?
You'll need to set up thevf.Rubric
object defines one or more reward functions. Each function receivesprompt
,completion
and other parameters and returns a floating point number as the reward value. You can also set different weights for different reward functions. - Do I need to implement the interaction logic of the model myself?
Not necessarily. For single-round quizzes and standard tool call scenarios, you can just use theSingleTurnEnv
cap (a poem)ToolEnv
. Inheritance is only needed if your application requires very unique, non-standard interaction flowsMultiTurnEnv
concurrently rewriteis_completed
cap (a poem)env_response
Methods. - What should I do if I encounter NCCL-related errors during training?
According to the official documentation, vLLM may experience inter-GPU communication hangs when synchronizing weights. You can try setting theNCCL_P2P_DISABLE=1
to fix the problem. If the problem persists, try setting theNCCL_CUMEM_ENABLE=1
or raise an issue with the project.