Innovative design for simplified input
While traditional gaze prediction systems typically require fusion of multi-source sensor data, Gaze-LLE enables end-to-end prediction using only RGB images through the powerful representational capabilities of pre-trained visual coders. It is shown that base models such as DINOv2 already implicitly learn scene depth and human pose-related features, which makes additional input modalities a non-essential option.
This technical feature brings three practical advantages: reducing hardware dependency, which can be met by consumer-grade cameras; simplifying the data processing flow, avoiding the problem of aligning information from multiple sources; and improving the system robustness, reducing the prediction failure due to the lack of data from one modality. In benchmarks such as VideoAttentionTarget, this parsimonious design instead achieves better accuracy than multimodal approaches.
This answer comes from the articleGaze-LLE: A Target Prediction Tool for Character Gaze in VideoThe































