Production-level deployment of technology programs
The following two options are recommended for highly available deployments:
- vLLM server::
- Install the specialized version (
uv pip install --pre vllm==0.10.1+gptoss
) - Start the API service (
vllm serve openai/gpt-oss-120b --tensor-parallel-size 4
) - Configure Nginx reverse proxy and
pm2
process guard
- Install the specialized version (
- Kubernetes Program::
- Building a Docker image (refer to the repository)
Dockerfile.gpu
) - set up
resources.limits.nvidia.com/gpu: 2
Declare GPU requirements - pass (a bill or inspection etc)
HorizontalPodAutoscaler
Automatic capacity expansion and contraction
- Building a Docker image (refer to the repository)
Key optimization points include:
1. Enabling--quantization=mxfp4
Reduced 50% GPU memory usage
2. Settings--max-num-seqs=128
Enhance concurrent processing capabilities
3. Recommended use for monitoringvLLM PrometheusExporter
Collect QPS and latency metrics
This answer comes from the articleCollection of scripts and tutorials for fine-tuning OpenAI GPT OSS modelsThe