Humans make approximately three eye movements per second to direct central vision (the fovea) to regions of interest in scenes. Eye movements are used in diagnostic medicine to assess conditions including concussions, dyslexia, fatigue, and many more. However, most approaches focus on the dynamics of eye movements (e.g., fixation stability, peak velocity, anti-saccades) without analyzing in detail the relationship between the fixations and the semantic content of the fixated visual information. Though where humans direct their gaze is a window into the mind, there is no benchmark theoretical framework that predicts how a human with normal vision and cognition should fixate a real-world scene to accurately comprehend it or answer a particular question as fast as possible. This lack of data limits the capabilities of eye movements as a diagnostic tool.
Researchers at the University of California, Santa Barbara have developed an innovative AI-powered method that models and evaluates optimal human eye movement patterns for scene comprehension to assess visual and cognitive health. This process utilizes a foveated visual-language model combined with reinforcement learning to simulate and predict optimal eye movements when viewing complex scenes and answering related questions. By comparing optimal fixation sequences to an individual’s actual eye-tracking data, the method provides a quantitative score reflecting the efficiency and accuracy of an individual's visual exploration.
Additionally, the approach measures scene-comprehension accuracy after N eye movements for two cases: the optimal foveated model and the same model driven by measured human fixations (a proxy for human comprehension). A generated score reflects the optimal human gap and measures the comprehension cost of any inefficiency in human eye-movement sampling. This method supports advanced diagnostics for visual and cognitive deficits by analyzing how closely human eye movements align with theoretically optimized patterns that maximize scene understanding and, critically, the cost in scene comprehension or accuracy in Q&A. The process also extends eye-tracking assessments beyond simple tasks to ecologically valid, real-world scenarios, leveraging large multimodal language models and novel AI-trained fixation-selection models to improve clinical and research applications in vision science.
eyes, eye movement, diagnostics, diagnostic medicine, vision, AI, cognitive health, eye tracking, concussion