Un browser web a controllo oculare con supporto LLM

This thesis presents a system which integrates eye-tracking technology and Large Language Models (LLMs) for enhancing web browsing. The developed application enables the selection of webpage elements through the gaze and processes their attributes (such as content, URLs, and styles) using an LLM. By integrating eye-based interaction in standard web browsing, in addition to mouse and keyboard, a smoother and more flexible user experience is provided. When one or more page elements are observed, their textual or graphical content can be easily sent to the LLM to obtain summaries, descriptions, paraphrases or translations. The system, tested with a Gazepoint GP3 HD eye tracker, has been implemented as a JavaScript browser extension, for real-time webpage element extraction, a Python module running on a client machine, and a communication mechanism happening via a WebSocket server. The positive feedback obtained from the participants in our user study, structured in different tasks on different websites, as well as their good performance, showed that the system is user-friendly and effective for our purposes. Integrating eye tracking and LLM technologies holds enormous potential for creating web interaction tools that are more natural and efficient. This project takes a further step towards realizing this potential. Future research may extend the system to dynamic content such as animations and videos, as well as investigate additional multimodal input, including voice and gestures.

Questa tesi presenta un sistema che integra la tecnologia dell’eye tracking con i Large Language Models (LLM) per migliorare la navigazione sul Web. Il sistema sviluppato consente di selezionare gli elementi della pagina web attraverso lo sguardo e di elaborarne gli attributi (come contenuti, URL e stili) utilizzando un LLM. Integrando l'interazione basata sullo sguardo nella navigazione web standard, oltre all’utilizzo di mouse e tastiera, si ottiene un'esperienza utente più fluida e flessibile. Quando si osservano uno o più elementi della pagina, il loro contenuto testuale o grafico può essere facilmente inviato all'LLM per ottenere riassunti, descrizioni, parafrasi o traduzioni. Il sistema, testato con l’eye tracker Gazepoint GP3 HD, è stato implementato come estensione JavaScript del browser (per l'estrazione in tempo reale degli elementi della pagina web), un modulo Python (in esecuzione su un computer client) e un meccanismo di comunicazione che avviene tramite un server WebSocket. Il feedback positivo ottenuto in uno studio che ha coinvolto diversi partecipanti, così come la loro buona performance, strutturato in vari compiti su diversi siti web, ha dimostrato che il sistema è facile da usare ed efficace per i nostri scopi. L'integrazione delle tecnologie eye tracking e degli LLM ha un enorme potenziale per la creazione di strumenti di interazione web più naturali ed efficienti. Questo progetto vuole essere un ulteriore passo avanti verso la realizzazione di questo potenziale. Lavori futuri potrebbero estendere il sistema a contenuti dinamici, come animazioni e video, e studiare ulteriori input multimodali, tra cui la voce e i gesti.