The rapid growth of online services is coupled with an increase of cyber criminal attacks. One of the most harmful of these attacks is represented by phishing, a manipulative security threat that aims at stealing sensitive information from users, thus causing various types of damages, such as financial losses, identity theft, or reputation damages. Phishing mainly manipulates human emotions by accurately mimicking legitimate sources. Hence, users are often confused and unable to detect phishing attacks. In this thesis work an approach for automatically detecting phishing webpages is proposed. The approach is based on Machine Learning techniques applied to classify webpages as phishing or legitimate. Each webpage is described by a set of features extracted from the URL used to reach the page and from its HTML source code. In particular, these features are selected on the basis of the techniques that attackers adopt when they design phishing webpages. The proposed approach is experimentally evaluated on a dataset consisting of 5,000 legitimate and 5,000 phishing webpages. The results obtained by applying three Machine Learning classifiers, namely, Random Forest, Support Vector Machine and Logistic Regression, prove that the approach is effective in detecting phishing webpages. The performances of the three classifiers are generally good, even though the Random Forest classifier outperforms the others and reaches an accuracy equal to 93%.
Detection of Phishing Websites using Machine Learning
ZIENI, RASHA
2020/2021
Abstract
The rapid growth of online services is coupled with an increase of cyber criminal attacks. One of the most harmful of these attacks is represented by phishing, a manipulative security threat that aims at stealing sensitive information from users, thus causing various types of damages, such as financial losses, identity theft, or reputation damages. Phishing mainly manipulates human emotions by accurately mimicking legitimate sources. Hence, users are often confused and unable to detect phishing attacks. In this thesis work an approach for automatically detecting phishing webpages is proposed. The approach is based on Machine Learning techniques applied to classify webpages as phishing or legitimate. Each webpage is described by a set of features extracted from the URL used to reach the page and from its HTML source code. In particular, these features are selected on the basis of the techniques that attackers adopt when they design phishing webpages. The proposed approach is experimentally evaluated on a dataset consisting of 5,000 legitimate and 5,000 phishing webpages. The results obtained by applying three Machine Learning classifiers, namely, Random Forest, Support Vector Machine and Logistic Regression, prove that the approach is effective in detecting phishing webpages. The performances of the three classifiers are generally good, even though the Random Forest classifier outperforms the others and reaches an accuracy equal to 93%.È consentito all'utente scaricare e condividere i documenti disponibili a testo pieno in UNITESI UNIPV nel rispetto della licenza Creative Commons del tipo CC BY NC ND.
Per maggiori informazioni e per verifiche sull'eventuale disponibilità del file scrivere a: unitesi@unipv.it.
https://hdl.handle.net/20.500.14239/14064