Memory management scheme for large model training
For the 3B+ parametric model, the following strategy is recommended:
- distributed training: Settings
N_GPUS=2Enables multiple cards to run in parallel and synchronize adjustmentsROLLOUT_TP_SIZE=2Number of matching GPUs - Instruct optimization: Use of the QWen-2.5-3B model
--template_type=qwen-instructParameters enhance command following ability - Video Memory Optimization: add when installing flash-attn
--no-build-isolationParameters to ensure compatibility - batch control: in
train_tiny_zero.shAdjusting batch sizes in balancing memory footprint and training speed
Experiment naming suggestions include model scale information such ascountdown-qwen2.5-3b-instruct. Be sure to confirm before training that you have passed theconda activate zeroActivate the environment and set it up correctlyDATA_DIRData set path variables.
This answer comes from the articleTinyZero: A Low-Cost Replication of DeepSeeK-R1 Zero's Epiphany EffectThe




























