Deployment guide for low-resource environments
For GPU or CPU-only environments below 8GB, a three-tier optimization strategy is available:
- Model Selection::OpenMed-NER-*TinyMed*Series (65M parameter) is designed for low resources, with a memory footprint of only 15% of the standard model.
- Quantitative acceleration: Add when loading the modeltorch_dtype=torch.float16Parameter enable half-precision to reduce 50% video memory usage, sample code:
model = AutoModel.from_pretrained(model_name, torch_dtype=torch.float16)
- batch control: Settingsbatch_size=2~4and enable CUDA streaming:
ner_pipeline(texts, batch_size=4, device=0, torch_stream=True)
- CPU-Only Program: Install the onnxruntime acceleration library to increase the runtime speed by up to 3 times after converting the model to ONNX format:
pip install optimum[onnxruntime]
Real-world testing shows that when running a 434M model on an NVIDIA T4 graphics card (16GB), the throughput can be increased from 12 to 58 entries/second with a combination of quantization + batch 8. Out of memory warnings can be set by settingmax_memoryParameter assignment hierarchical cache resolution.
This answer comes from the articleOpenMed: an open source platform for free AI models in healthcareThe