GEPA (Genetic-Pareto) is a framework for optimizing various types of text components in AI systems. These textual components can be prompts, code snippets or configuration files of AI models. It uses an approach called Reflective Text Evolution to analyze and reflect on the behavior of AI systems through Large Language Models (LLMs). Specifically, GEPA examines execution and evaluation records generated during system operation and uses this information to target improvements. The framework combines strategies such as iterative mutation, reflection, and Pareto-optimal selection to evolve a more performant version of the system with a limited number of evaluations.GEPA not only optimizes individual components, but also collaboratively evolves multiple components in a modular system, resulting in significant performance gains in specific domains. According to its research paper "GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning", GEPA demonstrates higher efficiency than traditional reinforcement learning methods by improving performance while requiring significantly fewer samples.
Function List
- Reflective Text Evolution:: Analyze system execution trajectories (e.g., reasoning processes, tool invocations, and outputs) using Large Language Models (LLMs) to diagnose problems and propose improvements in natural language form.
- Multi-objective optimization:: Using a Pareto-optimal selection mechanism, it is possible to optimize multiple objectives at the same time (e.g., shortening the length of cue words while ensuring accuracy) and retain a diverse set of good candidates.
- High sample efficiency: GEPA can achieve significant performance gains with very few samples ("rollouts") compared to traditional reinforcement learning methods that require thousands of attempts, reducing the number of samples required by up to 35 times.
- Wide range of applicability: Optimize not only AI cue words, but also code, commands, and complete AI programs such as
DSPySignatures, modules and control flow in a program. - Flexible adapter interface:: By realizing
GEPAAdapterinterface, users can integrate GEPA into any system that contains text components. At the heart of system integration is the definition ofEvaluate(Assessment) andExtract Traces for Reflection(Extracting Reflective Trajectories) Two Methods. - Integration with DSPy framework: GEPA has been directly integrated into
DSPyframework, the user can use thedspy.GEPAThe API is easy to call, and it's the easiest and most powerful way to use GEPA. - Support for complex system optimization: GEPA is capable of optimizing complex AI systems such as retrieval-enhanced generation (RAG) systems, multi-round conversational intelligences, and intelligences operating in external environments such as
terminal-bench).
Using Help
GEPA is a powerful framework designed to automatically optimize textual components of AI systems, such as prompts or code, by mimicking the human "reflect-and-improve" learning model. Below is a detailed description of how to use GEPA.
mounting
GEPA can be easily installed with pip, Python's package manager.
Stable version installation:
Open a terminal or command line tool and enter the following command:
pip install gepa
The latest development version is installed:
If you wish to experience the latest features, you can install them directly from the GitHub repository:
pip install git+https://github.com/gepa-ai/gepa.git
Core concepts
Effective use of GEPA requires an understanding of its two core concepts:
- Reflection:: The core mechanism of GEPA. Instead of looking at whether a task was ultimately successful (i.e., a simple score), it allows a powerful language model (called a "reflective model") to read a trace of the entire execution of the task. This record contains all of the AI's "thinking" steps, intermediate outputs, errors encountered, etc. By reading this detailed record, the reflective model is able to determine whether the task was successful or not (i.e., a simple score). By reading these detailed records, the reflective model can make specific, targeted suggestions for improvement in natural language.
- Evolution:: GEPA draws on the idea of a genetic algorithm. It starts with an initial cue word ("seed") and generates a number of new, potentially better, versions of the cue word through reflection ("mutations"). It then tests these new versions and retains the best performing batch ("selection"). This process is repeated over and over again, with each generation optimizing on the previous one, eventually evolving high-performance cue words.
Easiest way to use: through the DSPy framework
For most users, combining GEPA with theDSPyA combination of frameworks is the most recommended approach.DSPycan help you build modular language model programs, and GEPA acts as an optimizer to improve the performance of these programs.
Here's a simple example of optimizing a math problem solving cue word:
Step 1: Preparing the environment and data
Make sure you have installed thegepacap (a poem)dspy-ai, and set up your OpenAI API key.
import gepa
import dspy
# 设置大语言模型
task_lm = dspy.OpenAI(model='gpt-4.1-mini', max_tokens=1000)
# 设置一个更强大的模型用于反思
reflection_lm = dspy.OpenAI(model='gpt-5', max_tokens=3500)
dspy.settings.configure(lm=task_lm)
# 加载数据集(这里使用内置的AIME数学竞赛题示例)
trainset, valset, _ = gepa.examples.aime.init_dataset()
Step 2: Define the initial program (or cue word)
existDSPyin which you can define a simpleSignatureto describe the inputs and outputs of the task, and then use aModuleto realize it.
class CoT(dspy.Module):
def __init__(self):
super().__init__()
self.prog = dspy.ChainOfThought("problem -> reasoning, answer")
def forward(self, problem):
return self.prog(problem=problem)
Step 3: Define assessment indicators
You need to tell GEPA how to tell if an output is good or bad. Here we define a simple metric that checks whether the model outputs the right answer.
def aime_metric(gold, pred, trace=None):
# gold是标准答案,pred是模型的预测输出
return gold.answer == pred.answer
Step 4: Run the GEPA Optimizer
Now you can configure and rundspy.GEPAOptimizer now.
from dspy.teleprompt import GEPA
# 配置优化器
# dspy_program是你要优化的DSPy程序
# trainset是训练数据
# valset是验证数据
# metric是评估函数
# reflection_lm是用于反思的模型
optimizer = GEPA(dspy_program=CoT(),
trainset=trainset,
valset=valset,
metric=aime_metric,
reflection_lm=reflection_lm)
# 运行优化,设置优化预算(例如,最多调用评估指标150次)
optimized_program = optimizer.compile(max_metric_calls=150)
Upon completion of the execution, theoptimized_programThe internal hint words are already optimized by GEPA. You will notice that the optimized cue words contain very specific and detailed solution strategies and notes that GEPA automatically learns by reflecting on historical mistakes.
Independent use of GEPA (advanced usage)
If you are not using theDSPyframework, or you can use GEPA standalone.At this point, you need to implement your ownGEPAAdapter, as a bridge between GEPA and your system.
GEPAAdapterTwo key methods need to be implemented:
Evaluate(self, candidate, trainset_sample):- This method receives a candidate text component generated by GEPA (
candidate) and a portion of the training data (trainset_sample). - You need to run your system with this candidate component and return the system's execution score and detailed execution traces (traces). Traces can be any textual information that is useful for reflection.
- This method receives a candidate text component generated by GEPA (
ExtractTracesforReflection(self, traces, component_name):- This method receives
Evaluatemethod returns a trace and extracts from it the traces associated with a specific component (component_name) The relevant part. - The extracted text will be given to the reflective model for analysis.
- This method receives
This is a conceptual example structure:
from gepa.core import GEPAAdapter
class MyCustomAdapter(GEPAAdapter):
def Evaluate(self, candidate, trainset_sample):
# 你的系统逻辑:使用candidate中的提示词处理trainset_sample中的数据
# ...
scores = [...] # 计算得分
traces = [...] # 收集详细的日志或中间步骤
return scores, traces
def ExtractTracesforReflection(self, traces, component_name):
# 从traces中提取和component_name相关的文本信息
# ...
return relevant_textual_traces
# 然后调用gepa.optimize
gepa_result = gepa.optimize(
seed_candidate={"my_prompt": "Initial prompt here..."},
adapter=MyCustomAdapter(),
trainset=my_train_data,
valset=my_val_data,
# ... 其他参数
)
This approach is more complex, but it provides great flexibility and allows GEPA to be used to optimize any text-based system.
application scenario
- Complex reasoning task prompt word optimization
For complex tasks requiring multi-step reasoning (e.g., math, logic, and strategy planning), where a small change in the cue word can lead to a huge difference in results, GEPA analyzes the model's chain of reasoning, automatically identifies and corrects logical flaws, and generates highly optimized commands that guide the model to a more efficient solving strategy. - Code Generation and Optimization
GEPA not only generates code, but also optimizes it based on textual feedback such as compilation errors, performance analysis reports, or code review comments. For example, it can take a generic code snippet and iteratively modify it into a highly optimized version based on documentation and error information for a specific piece of hardware (e.g., GPU). - Retrieval Augmented Generation (RAG) System Tuning
The RAG system consists of multiple segments (query reconstruction, document retrieval, answer synthesis, etc.), each of which is driven by cue words.GEPA can optimize all of these cue words at the same time, improving the accuracy of retrieval and the quality of the answers by analyzing the execution trajectory of the entire RAG system. - Fine-tuning of Intelligent Body (Agent) Behavioral Instructions
For intelligences that need to interact with external tools or environments, GEPA can optimize their core commands (i.e., system cue words) by analyzing the intelligences' behavioral logs (including API calls, results returned by tools, and environmental feedback) to allow them to complete their tasks more efficiently and reliably. - Instruction learning for domain-specific knowledge
In specialized domains (e.g., medical, legal, financial), AI systems need to strictly follow specific guidelines and specifications.GEPA can use these guideline documents as a basis for reflection, and when the system output is not compliant with the specifications, GEPA can automatically incorporate the relevant rules into the cue words to make the system output more compliant.
QA
- How does GEPA differ from traditional reinforcement learning (RL) optimization methods?
The main difference is the richness of the learning signal. Traditional RL methods typically rely on a single, sparse reward score (e.g., 1 point for task success, 0 points for failure), and the model requires a large number of attempts to learn an effective strategy. GEPA, on the other hand, utilizes rich natural language feedback, and "reads" the detailed execution process logs through LLM to understand the specific causes of failure, thus enabling more precise improvements with fewer samples. - Does using GEPA require a very powerful language model?
The design of GEPA includes two models: a "task model" that is optimized and a "reflective model" that is analyzed. It is often recommended to use a model that is as capable as possible as the "reflective model" (e.g., GPT-4 or higher), as it requires a deep understanding of complex execution trajectories and contexts. The optimized "task model" can be anything you need to improve performance, including smaller, more economical models. - What does "Pareto" in GEPA mean?
"Pareto" is derived from the concept of Pareto optimality, which is used in multi-objective optimization. In GEPA, this means that the optimization process does not just aim for the highest score on a single metric (e.g., accuracy), but may also consider other objectives such as cue length, API call cost, or response latency, while retaining a "Pareto frontier," i.e., a set of candidate solutions that are well-balanced with respect to the different objectives, rather than just a single "best" solution. GEPA will retain a "Pareto frontier", i.e. a set of candidates that strike a good balance between the different objectives, rather than just a single "best" option. - Is GEPA only able to optimize English cue words?
No. GEPA's underlying mechanism is based on the ability of language models to understand and generate text, so it naturally supports multiple languages. As long as you provide training data, evaluation metrics, and reflection models that support the appropriate language (e.g., Chinese), GEPA can be used to optimize text components for that language.






























