How to overcome the tensor dimension error in CSM Voice Cloning when processing long audio?

2025-08-29

1.6 K

Full Process Solution for Long Audio Processing

The system will report an error when the audio exceeds 3 minutes:

hardware solution
Upgrade your graphics card to an RTX3060 or higher model with at least 12GB of video memory to ensure:
- CUDA version ≥ 11.8
- PyTorch with cudnn acceleration enabled
Software adjustments
Modify key parameters:
1. Find the max_seq_len parameter in models.py
2. Recommended Value:
  - 5 minutes of audio: set to 6144
  - 10 minutes of audio: 12288
3. Synchronized modification of the corresponding parameter of llama3_2_100M()
alternative
Split long audio using ffmpeg:ffmpeg -i long.mp3 -f segment -segment_time 180 -c copy out%03d.mp3