Bandwidth Optimization Strategies
FlashMLA improves H800 memory bandwidth in three dimensions:
- Data Layout Optimization::
- utilization
torch.channels_lastmemory format - Split the KV cache into
[num_blocks, 64, h_kv, d]The 4D tensor of the
- utilization
- Access Mode Control::
- pass (a bill or inspection etc)
tile_scheduler_metadataPlanning consolidated memory accesses - set up
num_splits=4Balancing Parallelism and Localization
- pass (a bill or inspection etc)
- Pre-acquisition mechanism::
- exist
block_tablePreload the next possible page in the - utilization
cudaMemAdviseSetPreferredLocationdraw attention to sth.
- exist
real-world parameters
Example configuration to achieve 3000 GB/s bandwidth on H800:
- Batch size: ≥64
- Header dimension: multiples of 128 (e.g., 256)
- Parallelism:
CUDA_VISIBLE_DEVICES=0,1,2,3
Monitoring Methods
(of a computer) runnvidia-smi dmon -s uObserve the memory bandwidth utilization, the target value should be stable at 80% or above.
This answer comes from the articleFlashMLA: Optimizing the MLA Decoding Kernel for Hopper GPUs (DeepSeek Open Source Week Day 1)The































