The model innovatively integrates two acceleration techniques, speculative Jacobian decoding and quantization computation. Speculative decoding reduces sequence generation steps by predicting multiple tokens in parallel, and quantization technology compresses model parameters to 8-bit precision. Benchmark tests show that it takes 694 seconds to generate a 768×768 image on an A100 graphics card in standard mode, but only 304 seconds with dual acceleration, which is a significant speed improvement. At the same time, the memory footprint has been reduced from 80GB to 33.8GB, enabling consumer graphics cards such as the RTX 4090 to run high-resolution generation. This combination of technologies maintains generation quality (SSIM index > 0.92) while dramatically lowering the barrier to use, a significant breakthrough in engineering realization.
This answer comes from the articleLumina-mGPT-2.0: an autoregressive image generation model for handling multiple image generation tasksThe