Architectural Principles of Gaze-LLE
Gaze-LLE is a computer vision tool developed by a team at Georgia Tech whose core technical architecture is built on pre-trained visual base models. The tool innovatively employs frozen visual coders such as DINOv2 as the backbone network, requiring only the training of lightweight gaze decoder modules. This design allows the number of model parameters to be reduced by 1-2 orders of magnitude compared to traditional methods, and the typical parameter size to be compressed from hundreds of millions to millions.
The core breakthrough is reflected in two aspects: first, it completely relies on RGB image input, discarding the depth information or human gesture data required by traditional methods; second, it realizes efficient prediction through feature multiplexing, and a single image encoding can support the analysis of multiple gazes in a scene. This architecture makes Gaze-LLE significantly better than existing solutions in terms of computational efficiency and ease of deployment.
This answer comes from the articleGaze-LLE: A Target Prediction Tool for Character Gaze in VideoThe































