Innovative design of the DUPO algorithm
WebAgent uses the original DUPO (Dual-Phase Unified Optimization) algorithm for model optimization, a framework that integrates supervised learning and reinforcement learning in phases. The first phase is supervised fine-tuning with 500,000 annotated data to build up the base capabilities, while the second phase employs Reinforcement Learning Based on Human Feedback (RLHF) using 30,000 high-quality search traces for policy optimization. This dual-phase training enables the model to demonstrate a 42% generalization capability improvement on unknown task types in the BrowsingBench test set.
Key innovations in the training process
- Dynamic Course Learning: Adaptive adjustment of task difficulty gradients based on model performance
- Multidimensional reward function: Simultaneous optimization of accuracy, efficiency and information credibility indicators
- Confrontation Sample Enhancement: Enhancement of immunity to interference with the SailorFog-QA dataset
Engineering Realization Advantages
The framework supports distributed training and can control the training time of 72B models within 72 hours on a 512-card GPU cluster. The optimized model parameter count utilization is increased by 60%, which can handle more complex cross-domain query tasks under the same computing resources. More than 200 tuning parameter templates provided by the open source community dramatically reduce the threshold for developers to perform migration learning.
This answer comes from the articleWebAgent: An Intelligent Web Information Search and Processing ToolThe





























