Full-flow solution for multimodal input processing
For multimodal input scenarios such as image + text, AIRouter provides a standardized processing flow:
1. Data pre-processing
- Images need to be converted to Base64 encoding (recommended resolution is no more than 1024px)
- Text prompts need to contain clear processing instructions (e.g., "Describe the content of the image").
2. Model calls
Use the generate_mm method and specify a model that supports multimodality (gpt4o_mini is currently recommended):
response = LLM_Wrapper.generate_mm(
model_name="gpt4o_mini",
prompt="Describe image",
img_base64=your_base64_string
)
3. Exception handling
- Checking the log for MultimodalError type errors
- Docker deployments need to make sure that image processing dependencies such as pillow are installed
Extended Suggestion: For medical imaging and other specialized fields, it is recommended to work with professional annotation tools to preprocess images before input.
This answer comes from the articleAIRouter: Intelligent Routing Tool for Calling Multiple Models with Unified API InterfaceThe