Solutions to Improve Math Reasoning Speed
The MiMo-7B-RL model is optimized by a triple technical solution to address the problem of slow reasoning on mathematical competition questions:
- Multiple Token Prediction (MTP): Setting up in the vLLM inference engine
num_speculative_tokens=1parameters, multiple token sequences can be predicted to achieve the acceptance rate of 90%. Empirical tests show that this method reduces the inference waiting time of 30%. - Enhanced Learning Optimization: The RL version of the model, trained using a dataset of 130,000 math problems, provides a 2.3 times faster inference on AIME competition questions than the base version. It is recommended to prioritize the MiMo-7B-RL model.
- Seamless Rollback Engine: Although it mainly acts in the training phase, the model optimization brought about reduces the single inference time by 191 TP3T, which is particularly suitable for scenarios where multiple questions are answered consecutively.
Specific operational procedures:
- Install Xiaomi customized vLLM:
pip install "vllm @ git+https://github.com/XiaomiMiMo/vllm.git@feat_mimo_mtp_stable_073" - Add the MTP parameter when starting the service:
python3 -m vllm.entrypoints.api_server --model XiaomiMiMo/MiMo-7B-RL --host 0.0.0.0 --trust-remote-code --num_speculative_tokens 1 - Setting temperature=0.6 maintains a balance of speed and accuracy
Note: For AIME2024 questions, it is recommended to use an empty system prompt (prompt) for optimal performance. Hardware configuration recommended is at least NVIDIA A100 40GB GPU.
This answer comes from the articleMiMo: A Small Open Source Model for Efficient Mathematical Reasoning and Code GenerationThe































