Step3 Address formatting issues through standardized processing:
- input structure: mandates the use of an array of messages in the agreed format, with each element explicitly specified.
type
Fields (text/image/audio) - preprocessing unit: Built-in
AutoProcessor
Automatically recognizes and transforms different modal data into model-acceptable tensors.
Examples of specific implementations:
messages = [{
"role": "user",
"content": [
{"type": "image", "image": "https://example.com/img.jpg"},
{"type": "text", "text": "描述场景"}
]
}]
The design has been verified to support mixed input of JPEG/PNG images, MP3/WAV audio and UTF-8 text with an error rate below 0.1%.
This answer comes from the articleStep3: Efficient generation of open source big models for multimodal contentThe