Block sparse attention in MoBA is efficiently handled by the following mechanism:
Chunking stage:
- Divide the input sequence into N fixed-size context blocks
- Initial correlation scores are computed for each query token with all KV blocks
Attention allocation stage:
- Selection of the k highest scoring blocks based on parameter-free top-k gating
- Performs fine-grained attention calculations only within selected blocks
- Unchecked blocks get zero weight to avoid wasted computation
Dynamic regulation mechanisms:
- Each query token can make autonomous decisions about the combination of blocks to focus on
- Supports automatic switching between full attention (k = all blocks) and sparse attention
- Block size and k-value adjustable according to hardware conditions and task requirements
This hierarchical design of selective attention allows the model to significantly reduce the computational burden while ensuring that critical information is not lost.
This answer comes from the articleMoBA: A Large Language Model for Long Context Processing by KimiThe































