Bandwidth Optimization Strategies
FlashMLA improves H800 memory bandwidth in three dimensions:
- Data Layout Optimization::
- utilization
torch.channels_last
memory format - Split the KV cache into
[num_blocks, 64, h_kv, d]
The 4D tensor of the
- utilization
- Access Mode Control::
- pass (a bill or inspection etc)
tile_scheduler_metadata
Planning consolidated memory accesses - set up
num_splits=4
Balancing Parallelism and Localization
- pass (a bill or inspection etc)
- Pre-acquisition mechanism::
- exist
block_table
Preload the next possible page in the - utilization
cudaMemAdviseSetPreferredLocation
draw attention to sth.
- exist
real-world parameters
Example configuration to achieve 3000 GB/s bandwidth on H800:
- Batch size: ≥64
- Header dimension: multiples of 128 (e.g., 256)
- Parallelism:
CUDA_VISIBLE_DEVICES=0,1,2,3
Monitoring Methods
(of a computer) runnvidia-smi dmon -s u
Observe the memory bandwidth utilization, the target value should be stable at 80% or above.
This answer comes from the articleFlashMLA: Optimizing the MLA Decoding Kernel for Hopper GPUs (DeepSeek Open Source Week Day 1)The