vllm-cli has four built-in optimized configuration options that are specifically tuned for different usage scenarios:
- standard: Default configuration with smart parameters recommended by vLLM, suitable for most models and general usage scenarios
- moe_optimized: Optimized for the Mixed Expert (MoE) model, with tuned parameters related to expert selection and routing
- high_throughput: Configuration to maximize request throughput for scenarios that require high-frequency invocation of the model
- low_memory: Memory-optimized configurations that automatically enable technologies such as FP8 quantization for hardware environments with limited GPU memory
These predefined programs can be accessed through the--profile
Parameter quick call. In practical development, it is recommended to first try thestandard
configuration, and then select other optimization schemes or make custom parameter adjustments according to specific needs.
This answer comes from the articlevLLM CLI: Command Line Tool for Deploying Large Language Models with vLLMThe