Original: [State-of-the-art Code Generation with AlphaCodium - From Prompt Engineering to Flow Engineering]
By Tal Ridnik
AlphaCodium: Leading the Way to a New Realm of Code Generation, from Hint Engineering to Process Engineering
skim through
The challenges of code generation are different from ordinary natural language processing -- they involve strictly following the syntactic rules of the target programming language, recognizing normal and boundary cases, attending to numerous details in the problem specification, and coping with other problems and requirements specific to the code. As a result, many of the optimization techniques commonly used in the field of natural language generation may not be applicable for code generation tasks.
In this study, we propose a novel code generation method called AlphaCodium -- A test-based, phased, code-focused, iterative treatment process. This approach significantly improves the ability of the Large Language Model (LLM) to deal with code problems.
We tested AlphaCodium on a challenging code generation dataset, CodeContests, which contains competition programming topics from platforms such as Codeforces. Our approach consistently achieves significant performance gains in these tests.
For example, on the validation dataset, after using the AlphaCodium process, the accuracy (pass@5) of GPT-4 improved from 19% with a single well-designed direct cue to 44%.Not only does AlphaCodium outperform previous research, such as AlphaCode, but it also requires significantly less computational resource AlphaCodium.
We believe that many of the principles and best practices developed in this work are generally applicable to a wide variety of tasks in code generation. Our latest open-source project [AlphaCodiumOur AlphaCodium solution for CodeContests is shared in ], with full dataset evaluation and benchmarking scripts for further research and exploration by the community.
CodeContests dataset parsing
[CodeContests] is a challenging programming dataset from Google Deepmind. It draws from data such as [CodeforcesA competition programming platform such as ] features a selection of about 10,000 programming topics designed to train and evaluate large language models (large language models such as GPT or DeepSeek) Ability to solve complex programming problems.
This study did not focus on developing an entirely new model, but rather on creating a programming process that is applicable to a variety of large language models that are already capable of handling coding tasks. Therefore, we focused on the validation and test sets of CodeContests, which consisted of 107 and 165 programming problems. Figure 1 shows an example of a typical problem from the dataset:
Figure 1. A standard problem in CodeContests.
Each problem includes a problem description and some publicly available test data that can be used directly as model input. The challenge is to write a program that can give the correct answer for any legitimate input. In addition, there is a test set, which is not publicly available, that is used to evaluate the correctness of the submitted program.
Why are CodeContests an ideal dataset for testing the programming power of large language models? First, unlike other programming contest datasets, CodeContests contains a large amount of private test data (about 200 test cases per problem) to ensure the accuracy of the evaluation. Second, large language models are typically not good at noticing details in problem descriptions that are often critical to finding the right solution.The problem descriptions in CodeContests are typically both complex and detailed, full of nuances that affect the solution (a typical example is illustrated in Figure 1). This design simulates the complexity of real-world problems, forcing the model to consider multiple factors, which is in contrast to some of the simpler and more straightforward datasets (e.g. [HumanEval]) contrasts sharply. A typical HumanEval programming problem is illustrated in Appendix 1.
Figure 2 illustrates how the model analyzes the problem in Figure 1 in depth. Through in-depth analysis, the problem becomes clearer and more organized, which emphasizes the importance of a deeper understanding of the problem during programming.
Figure 2. Self-reflection generated by the AI for the problem described in Figure 1.
Proposed methodology
When dealing with the complex challenges of code generation, we have found that neither single-prompt optimizations nor continuous-thinking prompts have significantly improved the problem-solving efficiency of Large Language Models (LLMs) on CodeContest. This is because the model often struggles to fully understand the problem, and thus repeatedly generates code that is incorrect or unable to cope with new test cases. Approaches applicable to general natural language processing may not be ideal for code generation tasks. Such tasks hide great potential, such as repeatedly executing the generated code and verifying it against known examples. In contrast to cue optimization techniques in regular natural language processing, we find that solving CodeContest's problem using code generation and testing specifically forworkflowsmore effective. This process is centered arounditeration (math.)过程展开,即我们不断地运行和调整生成的代码,使其通过输入 – 输出测试。这种针对代码的流程的两个关键环节是:
(a) generating additional data in the preprocessing phase, such as self-reflection and reasoning for open test cases, to support the iterative process, and (b) augmenting open test cases with additional test cases generated by AI. In Fig. 3, we show the process we designed to solve the race programming problem:
Figure 3. The proposed AlphaCodium process.
Figure 3 Build preprocessing and code iteration flow
The process in Figure 3 is divided into two main phases:
- preprocessing stage, we reason about the problem using natural language, which is a linear process.
- Code Iteration phase, including multiple iterative sessions in which we generate, run, and fix code for various tests.
In Table 1, we review these different stages in detail:
Stage name | mission statement |
Problem Reflection | Summarize the problem in the form of concise bullet points that address the problem's objectives, inputs, outputs, rules, constraints, and other important details. |
Logical analysis of open test cases | Describe how the inputs to each test case lead to a particular output. |
Conceptualizing possible solutions | Suggest 2-3 possible solutions and describe them in layman's terms. |
Solution Evaluation | Evaluate the various possible solutions and select the best one, taking into account its correctness, simplicity and robustness. (It is not necessary to limit oneself to the most efficient solution). |
Supplemental AI testing | Supplement the problem with 6-8 different types of input-output tests that attempt to cover situations and aspects not covered by the original open test cases. |
Initial code program | The goal of this phase is to form an initial code solution to the problem. It is important that this code should be as close to the correct answer as possible, so that it is more likely to succeed in the correction process that follows. The operation process is as follows: - Select a possible scenario, write the appropriate code for it, and try it out in selected public test cases and AI tests. - Repeat this process until the test is passed or the maximum number of attempts is reached. - The first code that passes the test, or the code whose output is closest to the correct answer, will be used as the base code for subsequent steps. |
Iterative optimization of open test cases | Take the base code as a starting point, run it one by one in open test cases and optimize it. If the code has a problem in one of the tests, try to fix it based on the error message. |
Iterative Optimization for AI Testing | Continue iterative optimization in AI-generated tests. Utilize "test anchors" (a technique for fixing specific elements in tests to more accurately debug and improve code). |
Table 1. Characterization of the AlphaCodium phases.
In exploring the proposed process, we gained some deep intuitions and insights.
firstlycumulative knowledgestages of the process: we will start with simple tasks and gradually challenge ourselves with more complex problems. For example, in the first step of the process, _Self-Reflection_, we learn knowledge that can be used in subsequent, more difficult steps, such asGenerating possible solutions. The preprocessing phase of the process produces results that fuel the most challenging and critical part of the process: code iteration, where we actually try to write code that solves the problem correctly.
Next.Generating additional AI tests is easier than generating a full set of solution code -- This process relies heavily on understanding the problem and basic brute-force solving or logical reasoning without having to fully solve the problem to generate useful input-output test pairs. This is different from writing a complete and correct solution code, which requires us to come up with a complete algorithmic scheme that can correctly respond to any input-output test pair. As a result, we can create more AI tests and use them to optimize the code creation phase, as shown in Figure 4. We also further enhance the effectiveness of these additional tests by requiring the model to focus on what is not covered by the original public test cases, such as handling large inputs, edge cases, etc.
Finally.Multiple steps can be combined into a single large language model (LLM) call -- The process shown in Figure 3 is a conceptual demonstration, emphasizing the main steps of the process. In practice, by structuring the output (see next section), we can combine multiple phases into a single biglanguage model call to conserve resources or to improve the performance of the model when dealing with specific tasks at the same time.
Figure 4. demonstrates the improvements made when applying the AlphaCodium process.
Models often struggle when they solve code problems based only on direct hints. Iterating on publicly available test cases stabilizes and improves the solution, but leaves "blind spots" due to the lack of comprehensiveness of publicly available test cases. When we use the full AlphaCodium process, including the preprocessing phase and iterating on public and AI-generated tests, we can further improve the quality of the solution and significantly increase the success rate of problem solving.
Design concepts for code
In this section, we present some design concepts, techniques, and best practices that we have found useful when solving code generation problems. The AlphaCodium process we present in Figure 3 makes extensive use of these design concepts:
YAML structured output: A key part of our proposed process is the use of structured output - requiring the model to generate a YAML-formatted output equivalent to a specific Pydantic class. An example:
…
Your goal is to come up with possible solutions.
Ensure that the objectives, rules and constraints of the problem are fully considered in each program.
The output shall be a YAML object corresponding to the $PossibleSolutions type, according to the following Pydantic definition:
class Solution(BaseModel).
name: str = Field(description=”解决方案的名称”)
content: str = Field(description=解决方案的描述”)
why_it_works: str = Field(description=”为什么这个解决方案有效。需针对问题的规则和目标进行具体详细说明”)
complexity: str = Field(description=”解决方案的复杂性”)
class PossibleSolutions(BaseModel).
possible_solutions: List[Solution] = Field(max_items=3, description=”针对问题的可能解决方案列表。确保每个解决方案都全面考虑到了问题的规则和目标,并且在现代计算机上的运行时间合理 — 对于大量输入的问题约束,运行时间不超过三秒。”)
Table 2. Examples of structured output prompts (Generate Possible Solutions phase).
Structured output reduces the complexity of "cue engineering" and the need for hacking, and instead presents complex tasks in a straightforward, code-like manner. It also makes it possible to obtain complex answers with multiple stages that reflect logical and organized thought processes.
Although the new version of GPT supports [JSON style] output, but we believe that, especially in code generation tasks, YAML output is more appropriate, as detailed in the Appendix.
Bullet points analysis – 当让大语言模型 (LLM) 分析问题时,通常以要点列表(Bullet points)格式要求输出会获得更好的结果。要点促进了对问题的深入理解,并迫使模型将输出划分为逻辑上的语义区域,从而提高了结果的质量。例如,以要点自我反思问题(见图 2),每个要点代表了对问题不同部分的语义理解——一般描述、目标与规则、输入结构、输出结构。
Big Language Models are better at generating modular code – 当我们让大语言模型(LLM)去编写一个长篇的单个函数时,常常会遇到问题:代码中经常出现错误或逻辑漏洞。更严重的是,这种庞大而单一的代码块会影响到后续的迭代修复工作。即便提供了错误信息,模型也很难准确地定位和解决问题。但如果我们明确指导模型:“_把生成的代码分割成多个小的子功能模块,并给它们起上有意义的名称_”,结果会好得多,生成的代码错误更少,且在迭代修复的阶段有更高的成功率。
The importance of flexible decision-making and dual validation – 大语言模型在处理那些需要深思熟虑、合理推断和做出严肃、非常规决策的代码任务时,往往会遇到困难。例如,在生成问题的附加测试时,模型生成的测试常常存在错误。为了解决这个问题,我们引入了双重验证的过程。在这个过程中,模型在生成了初始输出之后,会被要求再次生成相同的输出,并在必要时进行修正。比如,模型在接收到它自己生成的 AI 测试作为输入后,需要重新生成这些测试,并及时纠正其中的错误(如果有的话)。我们发现,这种双重验证的步骤,不仅促使模型进行批判性思考和推理,而且比直接提出“这个测试正确吗?”这样的是/否问题更为有效。
Delay decision making, avoid direct questions, give space for exploration – 当我们直接向模型询问复杂问题时,经常会得到错误或不切实际的答案。因此,我们采取了类似 Karpathy 在下面的推文中所述的方法,逐步积累数据,从简单任务逐渐过渡到复杂任务:
- Start with the simplest tasks, namely self-reflection on the problem and reasoning about open test cases.
- Then move to generating additional AI tests and possible solutions to the problem.
- Only after obtaining the model's answers to the above tasks do we move on to the actual iterative process of generating code and running fixes.
Karpathy: This fits perfectly with the "need for a large language model (LLM)". Token The concept of "to think". In some cases, the chain of thought may simply serve to provide an additional store of information, rather than playing other, more important roles.
As another example, instead of choosing a single algorithmic solution, we evaluate and rank multiple possible solutions, prioritizing the top-ranked one for initial code writing. Since the model can be wrong, we prefer to avoid making irreversible decisions, and instead leave room for exploration, and code iterations that try different possible solutions.
Testing Anchor Technology – 尽管经过两次验证,某些 AI 生成的测试仍可能出错。这就带来了一个问题:当测试失败时,我们如何判断是代码的问题还是测试本身的错误?直接向模型查询“错误在哪里”时,我们常常得到不切实际的回答,有时甚至导致错误地修改了代码。为应对这一挑战,我们引入了一种名为“测试锚点”的方法:
- Start by iterating over publicly available tests that are known to be correct. Once this step is completed, designate all passed tests as benchmark tests (anchor tests).
- Then start checking the AI-generated tests one by one.
- Those that pass the test are added to the anchor test list.
- If the test fails, it defaults to an error in the code, which in turn attempts to correct the code. Importantly, the corrected code must also pass all existing anchor tests.
In this way, anchor tests act as a protection against incorrectly fixing our code as we fix it. Additionally, another improvement for anchor point tests is to prioritize the tests generated by the AI in order of difficulty. This made anchor tests more readily available early in the iterative process, which provided additional safeguards when dealing with more complex AI tests, especially when dealing with AI tests that were more likely to have incorrect output. This strategy effectively enhances the stability and reliability of the testing process, especially when dealing with complex and demanding AI-generated tests.
in the end
Comparison of Direct Tips with AlphaCodium
In Figure 5, we compare the results of AlphaCodium with those of a single well-designed direct hinting method. The evaluation criterion is pass@k (success rate in solving the problem), i.e., the proportion of solutions generated by using k for each problem.
Fig. 5. Comparison of the AlphaCodium method with the direct cueing method on different models.
It can be seen that the AlphaCodium approach significantly and consistently improves the performance of Large Language Models (LLMs) in solving programming problems with CodeContests. This conclusion applies to both open-source models (e.g., DeepSeek) and closed-source models (e.g., GPT), both on validation and test sets.
Comparison with other studies:
In Table 3, we show the results of AlphaCodium compared to other methods in the literature.
mould | data set | methodologies | score |
GPT-3.5 | validation set | AlphaCodium (pass@5) | 25% |
GPT-3.5 | validation set | CodeChain (pass@5) | 17% |
GPT-3.5 | test set | AlphaCodium (pass@5) | 17% |
GPT-3.5 | test set | CodeChain (pass@5) | 14% |
GPT-4 | validation set | AlphaCodium (pass@5) | 44% |
DeepMind Fine-tuning | validation set | AlphaCode (pass@10@1K) | 17% |
DeepMind Fine-tuning | AlphaCode (pass@10@100K) | 24% | |
GPT-4 | test set | AlphaCodium (pass@5) | 29% |
DeepMind Fine-tuning | test set | AlphaCode (pass@10@1K) | 16% |
DeepMind Fine-tuning | test set | AlphaCode (pass@10@100K) | 28% |
Gemini-pro | AlphaCode2: Comparison results for AlphaCode2 are not reported in existing CodeContests versions. According to Technical report on AlphaCode2, the researchers compared the results of AlphaCode with those of AlphaCode2 on an unpublished dataset and found a significant reduction in the number of large language model (LLM) calls (@100In the case of AlphaCode2, the performance of AlphaCode2 is comparable to that of AlphaCode, with both 29%, pass@10The |
Table 3: Comparison of AlphaCodium with other research works in the literature
Fig. 6: Efficiency comparison.
The figure shows that the AlphaCodium method demonstrates excellent performance under a variety of models and evaluation criteria, especially when solving programming challenges using large language models. These comparative results not only demonstrate AlphaCodium's technical innovation, but also highlight its effectiveness and applicability in real-world applications.
Overall, AlphaCodium demonstrates its remarkable potential in the field of intelligent programming, especially in enhancing the ability of large language models to handle complex programming problems. These findings provide important insights for future research and development, and provide valuable references for further development and optimization of large language models.
Fig. 6: Efficiency comparison. This is how AlphaCodium compares to other solutions in terms of accuracy vs. number of Large Language Model (LLM) calls. Compared to AlphaCode, AlphaCodium requires thousands of times fewer LLM calls to achieve similar accuracy.
When we compare AlphaCodium with the same GPT-3.5 model and the "5-try pass rate" criterion of the [...CodeChainWhen compared to [ ], it is clear that AlphaCodium performs better. When compared to [AlphaCodeWhen comparing the methods of [AlphaCode], it is important to note that AlphaCode employs a different code generation strategy: it optimizes a specific model to solve the coding problem, generates a large number of coding scenarios, and then classifies them, ultimately selecting a number of scenarios to submit from the main categories. For example, "10 attempts to pass out of 100,000 solutions" means that it generates 100,000 solutions, classifies them, and then selects 10 to submit.AlphaCode uses a specially optimized model that uses a higher number of LLM calls, similar to an exhaustive strategy. Nonetheless, AlphaCodium performed better in terms of top results.
It is also worth mentioning that neither AlphaCode nor CodeChain provides a replicable solution, including a complete end-to-end evaluation script. There are many details to consider when evaluating the results, such as handling multiple solution topics, fault tolerance mechanisms, timeout issues, etc. Our comparison is based on the data reported in their papers, but for reliability and reproducibility of future comparisons, we provide a complete set of reproducible code and evaluation scripts.
Comparison of computational strength: AlphaCode vs AlphaCode2
In AlphaCodium's process, solving each problem requires about 15-20 calls to the Large Language Model (LLM), which means that in the course of making five attempts, about 100 calls to the LLM are required.
And AlphaCode doesn't explicitly report how many calls to the big language model are needed per problem. If we assume that it is called once per attempt (which is still unknown and may actually be more), then for each of the 10 attempts, filtered from the 100,000 solutions, it would need to call the large language model 1 million times, which is four orders of magnitude more than AlphaCodium. However, from the results we have seen, AlphaCodium performs much better, as Figure 3 clearly demonstrates.
A recently published study called AlphaCode2 ([...Technical Report]) in which the researchers evaluated a model called Gemini-Pro fine-tuned for programming problems. The study also explored benchmarking of CodeContests, but using an unpublished updated version. According to the AlphaCode2 report, with only about 100 samples, AlphaCode2 achieves the level of performance that AlphaCode achieves with millions of samples, which makes it more than 10,000 times more sample-efficient than AlphaCode. As a result, both AlphaCode2 and AlphaCodium are much more efficient than AlphaCode in terms of the number of large language model calls.
However, AlphaCode2 employs an elaborate system specifically designed for CodeContests competitions.trimmingThe AlphaCodium uses a modern base model, whereas AlphaCodium uses an unmodified general-purpose model. Even so, AlphaCodium improves the performance of the model without additional data and expensive training phases.
appendice
1) An example of a manual evaluation of a code problem:
/*
In a set of numbers, checks if there are two numbers with a distance between them that is less than a specific numerical threshold. >>>
has_close_elements({1.0, 2.0, 3.0}, 0.5) false >>>
has_close_elements({1.0, 2.8, 3.0, 4.0, 5.0, 2.0}, 0.3) true
*/
#include
#include
#include
using namespace std.
bool has_close_elements(vector numbers, float threshold){
Table 4.The problem is relatively intuitive and simple, without much detail or subtlety for the model to reason about.
2) Why YAML output is better suited for code generation tasks than JSON output
Although the new version of GPT has [native support], but we believe that for code generation, YAML output is more appropriate. This is because generated code often contains single quotes, double quotes, and special characters. In JSON format, it is difficult for LLM to place these characters correctly because JSON output needs to be surrounded by double quotes. YAML output, on the other hand, [Adoption of block scalars] style, simply follow the indentation rules and any properly indented text or code is legal. Additionally, YAML output has fewer tokens, which means lower cost and faster inference times, as well as improved quality because the model has fewer non-critical tokens to focus on. Here's an example comparing JSON and YAML output (using [https://platform.openai.com/tokenizer] generated):
import json
import yaml
s1 = ‘print(“double quote string”)’
s2 = “print(‘single quote string’)”
s3 = ‘print(“””triple quote string”””)’
s4 = f”{s1}\n{s2}\n{s3}”
# Create a dictionary with keys as variable names and values as strings
data = {‘s1’: s1, ‘s2’: s2, ‘s3’: s3, ‘s4′: s4}
# Convert dictionary to JSON format string
json_data = json.dumps(data, indent=2)
print(json_data)
# Converting dictionaries to YAML format strings in block scalar style
yaml_data = yaml.dump(data, indent=2, default_style=’|’)
print(yaml_data)
Output.
Table 5.
JSON output:
Figure 7. Example of Token counting using JSON output.
A sample YAML output is shown below:
Figure 8. Example of Token counting using YAML output.
Obviously, generating code that only maintains proper indentation is not only more concise and clear, but also effective in reducing errors.