Gaze-LLE is a gaze target prediction tool based on a large-scale learning encoder, developed by Fiona Ryan, Ajay Bati, and other researchers. The core goal of the tool is to efficiently predict the gaze target of a person in a video or image by means of a pre-trained vision base model (e.g. DINOv2).
Its main functions include:
- Focus on target forecasting: Accurate prediction of gaze position using a pre-trained visual coder
- Multi-gaze prediction: Multiple people in a single image can be processed simultaneously
- Lightweight Architecture: learn lightweight decoders only on frozen pre-trained encoders
- Multi-model support: Provide pre-trained models based on different backbone networks (ViT-B/ViT-L) and training data
The salient advantages of Gaze-LLE over comparable tools are a 1-2 order of magnitude reduction in parameter size and the absence of additional input modalities (e.g., depth or attitude information).
This answer comes from the articleGaze-LLE: A Target Prediction Tool for Character Gaze in VideoThe































