Engineering solutions for lightweight deployment
For the different needs of 1B/3B models:
- Frame Selection: Support for Transformers native inference and vLLM optimization framework (the latter with 3-5x throughput increase)
- quantitative compression: Use
torch.quantizationCompresses 3B models to less than 2GB - Layered loading: Speech coding (xcodec2) and generative modeling can be deployed by device
Specific steps: 1) Usemodel.to('cpu')Test benchmark performance; 2) Enabletorch.jit.traceGenerate optimization graphs; 3) ONNX runtime support will be provided with the release of version 8B.
This answer comes from the articleLlasa 1~8B: an open source text-to-speech model for high quality speech generation and cloningThe































