Il Ruolo dell'Entropia di Tsallis nel Fenomeno Best of Both Worlds nei Multi-Armed Bandits

Sequential decision-making is a fundamental paradigm in modern statistics and computer science, and it has been receiving increased attention due to the recent developments in reinforcement learning, leading to algorithms such as AlphaGo, developed by Google DeepMind. Multi-Armed Bandits is a fundamental model in this framework, where a player sequentially picks actions and receives rewards with the goal of minimizing a notion of regret, finding the best balance between exploring actions and exploiting them. Over the past five years, one of the main breakthroughs in the theory of Multi-Armed Bandits was the discovery of the best of both worlds phenomenon: with a proper choice of entropic regularizer, the same algorithm (Follow-The-Regularized-Leader, FTRL) has been shown to be robust to both statistical and adversarial noise, adaptively and without prior knowledge on the nature of the game. Our goal in this thesis is to dissect recent literature on Multi-Armed Bandits, presenting an organic and stand-alone presentation of the best-of-both-words phenomenon that avoids the technical and involved exposition typically found in the literature. Particularly, we review the main ideas around this phenomenon and emphasize the key role the Tsallis Entropy plays in FTRL algorithms to adaptively achieve optimal regret bounds in both statistical and adversarial environments, without the need for specialized tuning of parameters. By connecting ideas from various papers, we present a cohesive framework and a unified view of adversarial and stochastic bandits, paving the way for simplified use of these results and ideas in broader settings.

Il sequential decision-making è un paradigma fondamentale nella statistica moderna e in computer science, che sta ricevendo sempre più attenzione anche grazie ai recenti sviluppi del reinforcement learning, che hanno portato ad algoritmi come AlphaGo, sviluppato da Google DeepMind. Multi-Armed Bandits (MAB) è un modello fondamentale in questo contesto, in cui un giocatore sceglie in maniera sequenziale azioni e riceve ricompense, con l’obiettivo di minimizzare una nozione di regret, cercando l’equilibrio ottimale tra l’esplorazione delle azioni e la loro capitalizzazione. Negli ultimi cinque anni, uno dei progressi rivoluzionari nella teoria dei Multi-Armed Bandits è stata la scoperta del fenomeno best-of-both-worlds: con una scelta appropriata del regolarizzatore entropico, un unico algoritmo (Follow-The- Regularized-Leader, FTRL) ha dimostrato di essere robusto sia al rumore statistico che a quello avversario, in modo adattivo e senza conoscenze pregresse sulla natura del gioco. Il nostro obiettivo in questa tesi è quello di analizzare la recente letteratura sui Multi- Armed Bandits, con una presentazione organica e a sé stante del fenomeno del best-of- both-words, evitando l’esposizione tecnica tipica della recente letteratura. In particolare, analizziamo le idee principali di questo fenomeno ed evidenziamo il ruolo chiave che l’entropia di Tsallis gioca nell’algoritmo FTRL per raggiungere, in modo adattivo, gli upper-bound ottimali dello pseudo-regret, in ambienti sia stocastici che avversari, senza la necessità di un tuning specifico dei parametri. Collegando le idee di vari articoli, presentiamo un quadro coeso e una visione unificata dei MAB avversari e stocastici, aprendo la strada a future applicazioni di questi risultati e idee in contesti più ampi.