Countdown task training is divided intoData preprocessingcap (a poem)model trainingThe two phases are described below:
Phase I: Data preparation
Execute the command:python ./examples/data_preprocess/countdown.py --local_dir {数据集路径}
The script will automatically:
- Generate training data that conforms to the Qwen model format
- Building a specific prompt template for numerical reasoning tasks
- Split training/validation set (default ratio 8:2)
Phase II: Training Initiation
Environment variables need to be configured:
BASE_MODEL: Base model path (e.g. Qwen-1.5B)DATA_DIR: Catalog of pre-processed dataEXPERIMENT_NAME: Experiment identification (for wandb records)
final executionbash ./scripts/train_tiny_zero.shInitiate training and the system will automatically:
- Loading veRL Strategy Networks and Value Networks
- Initiate Monte Carlo Tree Search (MCTS) for policy optimization
- Output validation set accuracy per 100steps
Typical training length: 1.5B model takes about 3.5 hours to reach 90%+ validation accuracy on a single H200.
This answer comes from the articleTinyZero: A Low-Cost Replication of DeepSeeK-R1 Zero's Epiphany EffectThe































