Delegate understands the mission operating procedures
The VLM-R1 is particularly good at the Referential Expression Comprehension (REC) task. Below are the details of how to use it:
training phase
- Download the required datasets: including the COCO Train2014 image dataset and the RefCOCO annotation file
- Configure training parameters: modify the training script in the src/open-r1-multimodal directory
- Start training: use the multi-GPU training command, example: torchrun -nproc_per_node=8 ...
inference stage
- Go to the eval directory: cd src/eval
- Run the test script: python test_rec_r1.py -model_path ...
- Provide input: upload an image and enter a natural language question such as "Where is the blue car in the picture?"
Input/Output Example
- importation: a picture containing multiple objects + a natural language query (e.g. "find the red cup in the bottom right corner of the picture")
- exports: Bounding box coordinates or positional description of the target object
caveat
For custom data, you can modify the data_config/rec.yaml configuration file to add your own image paths and labeling files.
This answer comes from the articleVLM-R1: A Visual Language Model for Localizing Image Targets through Natural LanguageThe































