Overseas access: www.kdjingpai.com
Bookmark Us
Current Position:fig. beginning " AI Answers

Reinforcement Learning Optimization Framework for WebAgent Significantly Improves Model Task Generalization Ability

2025-08-22 637
Link directMobile View
qrcode

Innovative design of the DUPO algorithm

WebAgent uses the original DUPO (Dual-Phase Unified Optimization) algorithm for model optimization, a framework that integrates supervised learning and reinforcement learning in phases. The first phase is supervised fine-tuning with 500,000 annotated data to build up the base capabilities, while the second phase employs Reinforcement Learning Based on Human Feedback (RLHF) using 30,000 high-quality search traces for policy optimization. This dual-phase training enables the model to demonstrate a 42% generalization capability improvement on unknown task types in the BrowsingBench test set.

Key innovations in the training process

  • Dynamic Course Learning: Adaptive adjustment of task difficulty gradients based on model performance
  • Multidimensional reward function: Simultaneous optimization of accuracy, efficiency and information credibility indicators
  • Confrontation Sample Enhancement: Enhancement of immunity to interference with the SailorFog-QA dataset

Engineering Realization Advantages

The framework supports distributed training and can control the training time of 72B models within 72 hours on a 512-card GPU cluster. The optimized model parameter count utilization is increased by 60%, which can handle more complex cross-domain query tasks under the same computing resources. More than 200 tuning parameter templates provided by the open source community dramatically reduce the threshold for developers to perform migration learning.

Recommended

Can't find AI tools? Try here!

Just type in the keyword Accessibility Bing SearchYou can quickly find all the AI tools on this site.

Top

en_USEnglish