How do I use the VLM-R1 for finger representation comprehension tasks?

2025-09-05

1.8 K

Delegate understands the mission operating procedures

The VLM-R1 is particularly good at the Referential Expression Comprehension (REC) task. Below are the details of how to use it:

Download the required datasets: including the COCO Train2014 image dataset and the RefCOCO annotation file
Configure training parameters: modify the training script in the src/open-r1-multimodal directory
Start training: use the multi-GPU training command, example: torchrun -nproc_per_node=8 ...

Go to the eval directory: cd src/eval
Run the test script: python test_rec_r1.py -model_path ...
Provide input: upload an image and enter a natural language question such as "Where is the blue car in the picture?"

importation: a picture containing multiple objects + a natural language query (e.g. "find the red cup in the bottom right corner of the picture")
exports: Bounding box coordinates or positional description of the target object

For custom data, you can modify the data_config/rec.yaml configuration file to add your own image paths and labeling files.