Gaze-LLE has realized a number of innovative breakthroughs in its technical architecture:
1. Model efficiency gains
- 1-2 orders of magnitude reduction in parameter size
- Only lightweight gaze decoder needs to be trained, base encoder stays frozen
2. Input modal simplification
- Moving away from dependence on depth sensors/attitude estimation
- Requires only RGB images as input
3. Basic modeling innovations
- Adoption of advanced visual base models such as DINOv2
- Support ViT-B/ViT-L and other backbone networks
4. Training data extension
- Supports joint training of GazeFollow and VideoAttentionTarget datasets
- Provide pre-trained models with different data combinations
These technical advantages make Gaze-LLE have significant advantages in terms of computational resource consumption, ease of deployment, and prediction accuracy, making it particularly suitable for application scenarios such as real-time video analytics.
This answer comes from the articleGaze-LLE: A Target Prediction Tool for Character Gaze in VideoThe































