A graph-based approach for web robot detection.

It is a common belief that the majority of Internet traﬃc is generated by humans. This is not completely true. In fact, about half of the traﬃc is generated by web robots, that is, software agents used to crawl websites and fetch their content in a completely automatic way. Some web robots are used by search engines to index web resources, while others are used for malicious purposes such as to exploit website vulnerabilities. The identiﬁcation of web robot traﬃc is therefore of paramount importance. The goal of this thesis work is to discover the structural properties of websites from the analysis of the log ﬁles stored by web servers and identify web robots. In particular, the identiﬁcation relies on an innovative graph-based approach, based on a Request Dependency Graph, whose nodes represent web resources and whose edges are their relationships. More speciﬁcally, the Request Dependency Graph determines primary resources, that in general are web pages, and embedded resources, that is, all the objects referenced inside pages. It is important to outline that the proposed method focuses on the browsing behavior of web robots and does not depend on the website structure, thus allowing its application on diﬀerent websites. From the structural analysis of websites, various behavioral patterns have been determined, highlighting the diﬀerences between human users and web robots.

Identificazione dei web robot tramite un approccio basato sui grafi. È opinione diﬀusa che la quasi totalità del traﬃco Internet sia generato dagli umani. Non è eﬀettivamente così. Nella realtà, la metà circa del traﬃco è generata da web robot, che sono programmi utilizzati per navigare i siti web ed acquisirne dati in modo del tutto automatico. Alcuni robot sono utilizzati dai motori di ricerca per l’indicizzazione delle risorse, mentre altri vengono usati per scopi malevoli come per sfruttare le vulnerabilità dei siti web. Per questi motivi, l’identiﬁcazione dei web robot è di primaria importanza. L’obiettivo di questo lavoro di tesi è di indagare le proprietà strutturali dei siti web, partendo dalla analisi dei log registrati dai server web, e di identiﬁcare i web robot. In particolare, l’identiﬁcazione viene eﬀettuata applicando un innovativo metodo basato sui graﬁ, il quale è fondato su un Grafo della Dipendenza tra le Richieste, i cui nodi rappresentano le risorse del sito ed i cui archi sono le loro relazioni. Scendendo nel dettaglio, il Grafo della Dipendenza tra le Richieste determina quali risorse sono primarie (che in generale sono pagine) e quali sono invece secondarie (tutti gli oggetti richiamati all’interno delle pagine). Dall’analisi strutturale dei siti sono stati determinati alcuni pattern comportamentali, i quali evidenziano le diﬀerenze tra gli utenti umani ed i robot. È importante sottolineare che il metodo proposto si basa sulle caratteristiche della navigazione dei web robot e non dipende dalla struttura del sito, perciò consente di essere applicato su diversi siti web.