Overseas access: www.kdjingpai.com
Ctrl + D Favorites

Windsurf 发布的软件工程专用模型 SWE-1-1

SWE-1: A New Generation of Cutting-Edge Models for Software Engineering Introduced

Recently, the industry's attention SWE-1 The series of models is officially released. Designed to optimize the entire software engineering process, this family of models aims far beyond the traditional task of writing code.

Currently.SWE-1The family contains three well-positioned models:

  • SWE-1: This workhorse model is claimed to be comparable in tool-call reasoning capability to the Claude 3.5 Sonnet Equalize while having lower operating costs. During the promotional period, all paid users can experience it for free (0 points/user tips).
  • SWE-1-lite: As a lightweight model.SWE-1-liteDesigned to replace the original with a higher quality Cascade Base Models. Unlimited access for all users, paid or unpaid.
  • SWE-1-mini: This is a compact and extremely responsive model for all users of the Windsurf Tab Passive experiences provide motivational support.

exploit (a resource) SWE-1 The motivation for the series is clear: to improve the efficiency of software development by 99%. Models with only "coding skills" can no longer meet the complex needs of modern software engineering, because writing code is only one part of the software development lifecycle.

A quick look at the background

The capabilities of coding models have advanced significantly in recent years. Industry expectations of these models have evolved from simple auto-completion suggestions to the ability to reliably build simple applications in a single pass.

However, existing coding models show their limitations in the following ways.

First and foremost, any software developer will agree that their time is not just spent writing code. Software engineering involves more types of tasks and a wider range of work interfaces. As a result, expectations for modeling capabilities should rise accordingly. The ideal model should not only be able to read and write code, but also be able to work in a terminal, access external knowledge bases and the Internet, test and experience products, and even understand user feedback. A software developer's job is much more than writing code.

Second, software development efforts typically involve advancing across multiple work interfaces over long periods of time and through a series of non-final states. Currently the top coding foundation models are still trained to focus primarily on the tactical level - for example, whether the final code will compile and satisfy unit tests. But for developers, unit testing is only one part of a huge engineering problem. There may be multiple ways to implement a currently usable feature, but there are far fewer options for implementing a great feature that can support iterations for years to come. This explains why many models in Cascade Tools can perform well with user-initiated guidance, but performance degrades significantly once the stand-alone runtime is extended. To achieve a higher degree of workflow automation, this limitation must be overcome, i.e., the model needs to be able to understand the full complexity of the engineering process: reasoning in an incomplete state and dealing with potentially ambiguous results.

At some point, simply improving coding skills no longer provides a substantial improvement in software engineering capabilities for either the software engineer or the model. The ultimate goal is to accelerate everything a software engineer can do, so the need for a "software engineering" model (or SWE model for short) has been clear for a long time.

SWE-1 Detail

Based on the results from the high-frequency use of Windsurf Editor With the insights gained from the platform, the development team set out to build a new data model (shared timeline) and training methodology that effectively encapsulates incomplete states, long-running tasks, and complex interactions across multiple work interfaces.

The initial goal is to demonstrate, through this approach, that even with a small engineering team and limited computational resources, the performance level of the frontier model can be achieved.SWE-1 is an initial proof of concept for this idea.

Overall.SWE-1 The performance is close to that of all frontier base models. Importantly, it outperforms all non-frontier models and open source alternatives. For benchmarking purposes, both offline evaluation and blind production experiments were conducted.

Offline assessment

The R&D team will SWE-1 The performance of the Anthropic Series models (in Cascade (one of the most widely used models in the tool) as well as the leading open source coding models Deepseek cap (a poem) Qwen Comparisons were made.

Conversational SWE Task Benchmark: This test is performed from an existing Cascade The session starts midway through and the task is partially completed. Assessment Cascade How the tool responds to the user's next query. Its composite score of 0-10 is a weighted average of the reviewer's helpfulness, efficiency, and correctness ratings, as well as the target document editing accuracy metrics.

This benchmark is designed to capture Cascade The unique nature of human-computer collaboration and agent-based coding pioneered by the tool. As long as the model is imperfect, the ability to seamlessly intertwine with user input on partially completed tasks is an important indicator of the model's usefulness.

Windsurf 发布的软件工程专用模型 SWE-1-2

End-To-End SWE Task Benchmark: The test begins at the very beginning of the conversation, assessing Cascade The ability of the tool to satisfy the input intent through a selected set of unit tests. Its composite score of 0-10 is a weighted average of test pass rates and reviewer ratings.

This benchmark is designed to capture the ability of models to independently solve end-to-end problems. This use case is becoming increasingly important as the ability of all models to operate without human intervention increases.

Windsurf 发布的软件工程专用模型 SWE-1-3

Based on the results of the offline assessment, it can be assumed that SWE-1 Performance on these tasks has moved into the ranks of leading-edge models in the Basic Modeling Lab and outperforms medium-scale models and leading-edge models in the leading open-source alternatives. While not yet at the absolute top, it has shown the potential to compete with leading models.

Production environment experiment

Relying on a large community of users, production environment experiments were conducted to complement the results of the offline evaluation. To calculate these daily metrics, a blind test experiment was conducted in which some users participated without knowing the model they were using. The test model was kept constant for each user in order to measure its repeated use over time.

The experiment contains the Claude models as benchmarks, as they have historically been and continue to be the Cascade The most commonly used model in the tool.

Daily Lines Contributed per User: Measurement of the number of people in a fixed period of time who are Cascade The average number of lines of code written by the tool and actively accepted and retained by users. This is a comprehensive and helpful metric that reflects both the usefulness of the model's contribution each time it is invoked and the willingness of the user to continue using the model over time.

This is considered a highly indicative metric, balancing proactivity and quality of advice with speed of output and responsiveness to feedback, all of which combine to drive repeat business.

Windsurf 发布的软件工程专用模型 SWE-1-4

Cascade Contribution Rate: For those who have been at least Cascade files that have been edited once by the tool, this metric calculates the number of files from the Cascade Percentage of changes to the tool. This is a measure of helpfulness, normalized for the frequency with which the model is used by the user and the willingness of the model to contribute code. Since this metric only measures model-edited files, it attempts to control for the effects of frequency of use and model-editing propensity.

Windsurf 发布的软件工程专用模型 SWE-1-5

SWE-1 It is for users with Cascade The tool's interaction patterns were built and optimized, so it is not surprising that its performance in these production experiments was near industry-leading.

Other Models and Analyses

In the chart above, theSWE-1-lite act as SWE-1 A medium-scale version of the model, constructed using the same training methodology. It leads all other non-frontier, medium-scale models and will replace the original Cascade Base model to be an unlimited-use option for all users.Cascade Base Previously used as a base model option to provide users with pervasive coding assistance, the SWE-1-lite The upgrades bring better quality and performance.

In addition, a third model was constructed SWE-1-miniIt shares much of the training methodology around flow awareness, but is small enough to operate within the latency constraints of passive prediction systems and target predictive action tasks (not tool calls). It shares much of the training methodology around flow awareness, but is small enough to operate within the latency constraints of a passive prediction system, and is further trained for predictive action tasks (rather than tool calls). This passive prediction system is able to intelligently anticipate and assist the user while they are coding, such as in the case of Windsurf Tab The experience allows it to silently analyze the context in the background and give suggestions at the right time.

It needs to be clear that this is just the beginning. Ultimately, in software engineering, the goal is not just to match the performance of any research lab's cutting-edge models, but to surpass them. There is more reason than ever to believe that the engine to drive this goal is in place, and the future will be heavily invested in this strategy.

Core Technology: Flow-Aware System

It was mentioned that "based on the data from the high-frequency use of the Windsurf Editor insights gained from the platform." It is necessary to explain Windsurf Editor How the platform has contributed to the SWE-1 was born, and why it is confident that its model will ultimately be the best.

The key is how to realize incremental iteration: flow awareness.

What is Process Awareness? Building Windsurf Editor The platform is designed to create a seamless interweaving between the combined states of the user and the AI; anything the AI does, the human should be able to observe and manipulate; and likewise, anything the human does, the AI should be able to observe and manipulate. This perception of a shared timeline is called "flow-awareness", which is why the collaborative agent experience has been called "AI flows".

Why is an editor that supports process awareness critical? Simply put, it will be some time before any SWE model can truly do all of its work on its own. In this transition, process-awareness enables the right form of interaction: leveraging the model's existing capabilities, allowing humans to step in to correct it when it goes wrong, and then the model can continue to build based on human actions. This enables a seamless, natural switchover.

This means that at any given moment, by observing the steps completed by the model with and without user intervention within the shared timeline, theWindsurf The team always understands the true capacity limits of the current model. It is able to get a large-scale, accurate picture of where users want their models to improve next. It is this mechanism that allows it to rapidly build models to today's SWE-1 achieved, and therefore confident that the absolute best SWE model will eventually be constructed.

In fact, whether it's noticed or not, building shared timelines has always been the Cascade The guiding vision behind many of the tool's key features:

  • exist Cascade When the tool was released, one of the features it emphasized was the ability for the user to make some edits in the text editor and then Cascade Enter "continue" in theCascade It then automatically integrates the edits made by the user.This reflects the perception of a text editor.
  • Soon after, the terminal outputs were also integrated into the process sense, allowing the Cascade The tool seamlessly senses the errors that the user encounters while running the code.This reflects the perception of the terminal.
  • exist Wave 4 In this version, the concept of Previews has been introduced to make the Cascade The tool is able to develop an understanding of the front-end components or bugs that the user is interacting with and is interested in.This reflects a basic perception of the browser.

However.Windsurf in the platformeverythingare built on the concept of process awareness, not just the Cascade Tools.Tab function is also built on the same shared timeline concept. When sending a message to the Cascade When the tool adds context, it is actually adding context to the Tab Add context. It's not simply a matter of cramming more information into a fixed context window at random, but carefully constructing a shared timeline that best reflects user behavior and goals. That's why its version of Tab The following characteristics are available:

  • Sense the user's terminal commands (Wave 5)
  • Senses what is copied from the user's clipboard (Wave 5)
  • Sensing the current Cascade Dialogue (Wave 5)
  • Sensing User Search in the IDE (Wave 6)

The release is not a random feature. It has been dedicated to building the richest representation of the shared timeline of software engineering work. Even when using off-the-shelf models, their tools have been significantly improved by the sheer presence of information in the shared timeline. And now, with the self-developed SWE model, it's possible to really kick-start the flywheel: enable the model to digest the timeline and start acting on an ever-broader timeline.

future outlook

As mentioned earlier.SWE-1 The achievement was realized by a small but highly dedicated team, building on its strengths as a product and infrastructure company. It represents the first attempt at building a truly cutting-edge quality model, and while proud of the results, it is well aware that this is only the beginning. Already, it has emphasized the power of its unique applications, systems, and modeling flywheel - a capability that even the underlying modeling labs themselves may not have in the absence of the scale of application-level and activity-derived insights they operate at.

The future will continue to hear about SWE News of model family improvements. Further investments will be made to bring the best performance and lowest cost to users so that they can continue to use the Windsurf Platforms build bigger and better programs.

0Bookmarked
0kudos

Recommended

Can't find AI tools? Try here!

Just type in the keyword Accessibility Bing SearchYou can quickly find all the AI tools on this site.

inbox

Contact Us

Top

en_USEnglish