Performance Breakthroughs from PPO Optimization
The Seed-X team has deeply tuned the base instruction model with the Proximal Policy Optimization (PPO) algorithm in reinforcement learning, resulting in a version of Seed-X-PPO-7B that significantly outperforms Seed-X-Instruct-7B in a number of metrics.The test data shows that on the WMT2023 test set, the PPO version improves the BLEU value of the Chinese-English translation by 15.21 TP3T and terminology accuracy by 22.71 TP3T, which is especially advantageous when dealing with low-resource languages (e.g., Kiswahili).
This enhancement stems from the PPO algorithm's continuous optimization of translation strategies: the model receives instant feedback rewards in multiple dimensions including fluency, fidelity, terminology accuracy, etc., and learns the optimal translation strategy through several rounds of iterations. For example, in the translation of e-commerce product descriptions, the PPO version can better maintain the accurate conversion of specification parameters (e.g., '5W-40' motor oil number), and at the same time reasonably deal with culturally specific expressions (e.g., 'best before date' corresponds to the customary expression of each country). (e.g. 'best before date' corresponds to the customary expression of each country).
The team recommends that production environments prioritize the PPO version, whose model weights and inference code can be accessed directly through the Hugging Face hub, and deployed in a way that is fully compatible with the base version.
This answer comes from the articleSeed-X-7B: Efficient Multilingual Translation of Large ModelsThe