Analisi della performance di algoritmi di filtering per reti complesse

In the field of complex systems, the study of networks has become of increasingly central interest. Through the mathematical framework of graph theory, we are able to represent and describe with the same formalism the dynamics of very different systems, from metabolic processes to economic transactions, social relationships and many others. This approach, however, represents a computational challenge, since many networks of practical interest have an enormous number of units. Filtering algorithms have been developed in order to extract the most important features of a network. Most of these algorithms rely on statistical tests performed on each link of the network, checking whether it is statistically significant against a null hypothesis of (semi)random emergence of connections in the network. The output of a filtering algorithm is a backbone of the original network made of links considered statistically significant. Different algorithms output different backbones with different features characterized by a different statistical meaning. To date there is no method to quantify the goodness of an algorithm and to compare backbones in an objective way. The aim of this thesis is to fill this gap in the literature of network filtering algorithms. To do so, I proceeded in two steps: firstly, I compared the performance of four different filters (the Hypergeometric, the Disparity, the Pólya and the Max Entropy filters) in different scenarios. I tested them on four different networks, two empirical and two synthetic ones, and I implemented various metrics to compare the properties of the backbones extracted by the different filters. Secondly, I designed specific ground truth tests and metrics to quantify their ability to detect anomalies, hence to test a filter’s ability to recognize truly statistical significant links. These test were based on different types of manipulation of empirical networks. In conclusion, the aim of this thesis is to support researchers dealing with large network datasets to decide which filtering algorithm is the most suitable for the task they need to accomplish.

Nel settore dei sistemi complessi, lo studio delle reti ha assunto un ruolo sempre più centrale. Grazie alla teoria dei grafi, siamo infatti in grado di rappresentare e descrivere con un unico formalismo sistemi molto diversi tra loro, dai processi metabolici alle transazioni economiche, alle relazioni sociali e molti altri. Tuttavia, questo approccio rimane una sfida dal punto di vista computazionale, poiché diverse reti di interesse pratico hanno un numero enorme di unità. Sono stati sviluppati algoritmi di filtering per estrarre le caratteristiche più importanti di una rete e la maggior parte di questi algoritmi si basa su test statistici eseguiti sui link della rete, verificando per ogni link se questo è statisticamente significativo rispetto a un'ipotesi nulla di emergenza (semi)random nella rete. L'output di un algoritmo di filtering è una backbone (una spina dorsale) della rete originale formata dai link considerati statisticamente significativi. È chiaro che algoritmi diversi producono backbone diverse con caratteristiche diverse a cui corrisponde un diverso significato statistico. Ad oggi non esistono metodi per quantificare la bontà di un algoritmo e per confrontare le backbone in modo oggettivo. Lo scopo di questa tesi è quello di colmare questa lacuna nella letteratura sugli algoritmi di filtering delle reti. Per farlo ho proceduto in due step: in primo luogo, ho confrontato le prestazioni di quattro diversi filtri (l'Ipergeometrico, il Disparity, il Pólya e il Max Entropy) in diversi scenari. Li ho testati su quattro reti diverse, due empiriche e due sintetiche, e ho implementato varie metriche per confrontare le backbone estratte. In secondo luogo, ho progettato test specifici di ground-truth e metriche per quantificare la loro capacità di rilevare le anomalie, cioè di riconoscere i link veramente significativi dal punto di vista statistico. Questi test erano basati su diversi tipi di manipolazione delle reti empiriche. In conclusione, questa tesi ha come scopo aiutare ricercatori che hanno interesse a filtrare reti di grandi dimensioni a scegliere quale algoritmo di filtering sia il più adatto alle loro necessità