MoE Architecture Overview
The Mixture of Experts architecture is a special type of neural network design that dots.llm1 uses to balance model performance with computational efficiency.
Architectural Advantages
- computational efficiency: Although the model as a whole has 142 billion parameters, only 14 billion parameters are activated during inference, greatly reducing computational resource consumption
- dynamic routing: 6 routing experts and 2 sharing experts are dynamically selected for each input token, for a total of 8 expert networks activated
- load balancing: Optimize expert network usage through dynamic bias terms to avoid overloading some experts
- performance enhancement: Combining the SwiGLU activation function and the multi-head attention mechanism improves the model's expressive power
Technical details
The model adopts a unidirectional decoder Transformer architecture, replacing the traditional feed-forward network with a MoE structure containing 128 routing experts and 2 shared experts. The attention layer adopts the multi-head attention mechanism combined with RMSNorm normalization, which maintains the strong expressive power and improves the numerical stability.
This answer comes from the articledots.llm1: the first MoE large language model open-sourced by Little Red BookThe