LitServe is Lightning AI launched an open source AI model service engine, built on FastAPI, focusing on rapid deployment of inference services for general-purpose AI models. It supports a wide range of scenarios from large language models (LLMs), visual models, audio models, to classical machine learning models, and provides batch processing, streaming, and GPU auto-scaling, with at least a 2x performance boost over FastAPI. LitServe is easy to use and highly flexible, and can be self-hosted or fully hosted through Lightning Studios. LitServe is easy to use and highly flexible, and can be self-hosted or fully hosted through Lightning Studios, making it ideal for researchers, developers, and enterprises to quickly build efficient model inference APIs. officials emphasize its enterprise-class features, such as security, scalability, and high-availability, to ensure that production environments are ready to go out of the box.

Function List
- Rapid deployment of inference services: Support for fast conversion of models from frameworks like PyTorch, JAX, TensorFlow, etc. to APIs.
- batch file: Merge multiple inference requests into a batch to improve throughput.
- streaming: Support real-time inference result stream output, suitable for continuous response scenarios.
- GPU Auto Scaling: Optimizes performance by dynamically adjusting GPU resources based on inference load.
- Composite AI System: Allows multiple models to reason collaboratively to build complex services.
- Self-hosted vs. cloud hosting: Supports local deployment or management through the Lightning Studios cloud.
- Integration with vLLM: Optimizing inference performance for large language models.
- OpenAPI Compatible: Automatically generates standard API documentation for easy testing and integration.
- Full Model Support: Covering the inference needs of various models such as LLM, vision, audio, embedding, etc.
- Server Optimization: Provides multi-process processing and inference more than 2x faster than FastAPI.
Using Help
Installation process
LitServe is easy to install with Python's pip The tool will do the job. Below are the detailed steps:
1. Preparing the environment
Ensure that Python 3.8 or later is installed on your system; a virtual environment is recommended:
python -m venv venv
source venv/bin/activate  # Linux/Mac
venv\Scripts\activate     # Windows
2. Installation of LitServe
Run the following command to install the stable version:
pip install litserve
If you need the latest features, you can install the development version:
pip install git+https://github.com/Lightning-AI/litserve.git@main
3. Inspection of installations
Verify that it was successful:
python -c "import litserve; print(litserve.__version__)"
Successful output of the version number completes the installation.
4. Optional dependencies
If you need GPU support, install the GPU version of the corresponding framework, for example:
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
How to use LitServe
LitServe turns AI models into inference services through clean code. Here's how it works in detail:
1. Creation of a simple reasoning service
The following is an example of a composite reasoning service with two models:
import litserve as ls
class SimpleLitAPI(ls.LitAPI):
def setup(self, device):
# 初始化,加载模型或数据
self.model1 = lambda x: x ** 2  # 平方模型
self.model2 = lambda x: x ** 3  # 立方模型
def decode_request(self, request):
# 解析请求数据
return request["input"]
def predict(self, x):
# 复合推理
squared = self.model1(x)
cubed = self.model2(x)
return squared + cubed
def encode_response(self, output):
# 格式化推理结果
return {"output": output}
if __name__ == "__main__":
server = ls.LitServer(SimpleLitAPI(), accelerator="auto")
server.run(port=8000)
- (of a computer) run: Save as server.pyImplementationpython server.pyThe
- test (machinery etc): Use of curlSends a reasoning request:curl -X POST "http://127.0.0.1:8000/predict" -H "Content-Type: application/json" -d '{"input": 4.0}'Output: {"output": 80.0}(16 + 64).
2. Enabling bulk reasoning
Modify the code to support batch processing:
server = ls.LitServer(SimpleLitAPI(), max_batch_size=4, accelerator="auto")
- Operating Instructions::max_batch_size=4Indicates that up to 4 inference requests are processed at the same time and automatically merged to improve efficiency.
- Test Methods: Send the request multiple times and observe the throughput improvement:
curl -X POST "http://127.0.0.1:8000/predict" -H "Content-Type: application/json" -d '{"input": 5.0}'
3. Configuring streaming reasoning
For real-time reasoning scenarios:
class StreamLitAPI(ls.LitAPI):
def setup(self, device):
self.model = lambda x: [x * i for i in range(5)]
def decode_request(self, request):
return request["input"]
def predict(self, x):
for result in self.model(x):
yield result
def encode_response(self, output):
return {"output": output}
server = ls.LitServer(StreamLitAPI(), stream=True, accelerator="auto")
server.run(port=8000)
- Operating Instructions::stream=TrueEnabling streaming reasoning.predictutilizationyieldReturns results one by one.
- Test Methods: Use a client that supports streaming responses:
curl --no-buffer -X POST "http://127.0.0.1:8000/predict" -H "Content-Type: application/json" -d '{"input": 2}'
4. GPU auto-scaling
If a GPU is available, LitServe automatically optimizes inference:
- Operating Instructions::accelerator="auto"Detect and prioritize GPUs.
- validate (a theory): Check the logs after running to confirm GPU usage.
- Environmental requirements: Ensure that the GPU version of the framework (e.g. PyTorch) is installed.
5. Deployment of complex model reasoning (using BERT as an example)
Deploy Hugging Face's BERT model inference service:
from transformers import BertTokenizer, BertModel
import litserve as ls
class BertLitAPI(ls.LitAPI):
def setup(self, device):
self.tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
self.model = BertModel.from_pretrained("bert-base-uncased").to(device)
def decode_request(self, request):
return request["text"]
def predict(self, text):
inputs = self.tokenizer(text, return_tensors="pt").to(self.model.device)
outputs = self.model(**inputs)
return outputs.last_hidden_state.mean(dim=1).tolist()
def encode_response(self, output):
return {"embedding": output}
server = ls.LitServer(BertLitAPI(), accelerator="auto")
server.run(port=8000)
- (of a computer) run: After executing the script, access the http://127.0.0.1:8000/predictThe
- test (machinery etc)::
curl -X POST "http://127.0.0.1:8000/predict" -H "Content-Type: application/json" -d '{"text": "Hello, world!"}'
6. Integrate vLLM to deploy LLM reasoning
Efficient reasoning for large language models:
import litserve as ls
from vllm import LLM
class LLMLitAPI(ls.LitAPI):
def setup(self, device):
self.model = LLM(model="meta-llama/Llama-3.2-1B", dtype="float16")
def decode_request(self, request):
return request["prompt"]
def predict(self, prompt):
outputs = self.model.generate(prompt, max_tokens=50)
return outputs[0].outputs[0].text
def encode_response(self, output):
return {"response": output}
server = ls.LitServer(LLMLitAPI(), accelerator="auto")
server.run(port=8000)
- Installing vLLM::pip install vllmThe
- test (machinery etc)::
curl -X POST "http://127.0.0.1:8000/predict" -H "Content-Type: application/json" -d '{"prompt": "What is AI?"}'
7. View the API documentation
- Operating Instructions: Access http://127.0.0.1:8000/docs, Interactive Test Reasoning Service.
- Function Tips: Based on the OpenAPI standard and contains all endpoint details.
8. Hosting options
- self-hosted: Run the code locally or on the server.
- cloud hosting: Deployed via Lightning Studios, requires account registration, offers load balancing, auto-scaling, and more.
Operating Tips
- adjust components during testing: Settings timeout=60Avoid reasoning timeouts.
- log (computing): Check the terminal logs at startup to troubleshoot the problem.
- make superior: Refer to the official documentation to enable advanced features such as authentication and Docker deployment.
LitServe supports the full range of process requirements from prototyping to enterprise-class applications through rapid deployment and optimization of inference services.
































 English
English				 简体中文
简体中文					           日本語
日本語					           Deutsch
Deutsch					           Português do Brasil
Português do Brasil