This study investigates the application of SAFE AI and machine learning techniques for the early detection and evaluation of cyber attack risks. The main objective is to develop a predictive framework that not only achieves strong classification performance but also ensures stability, explainability, fairness, and robustness under realistic uncertainty conditions. The analysis is based on the Cyber Event Database from the Center for International & Security Studies at Maryland, comprising 13,841 cleaned observations across 167 countries mapped into six continents. Using key variables such as year, actor, actor type, industry, event subtype, motive, and continent, cyber attack patterns are aggregated into a binary severity classification (High/Low) according to their frequency, which serves as the target variable. Several supervised learning models are implemented, including Logistic regression, Random Forest, Support Vector Machine, and XGBoost Classifier. To address class imbalance, both class weight adjustment and SMOTE are applied, with SMOTE-selected models used for further SAFE AI evaluation. Model performance is assessed through standard classification metrics such as Accuracy, TPR, FPR, and ROC AUC, and complemented by SAFE AI’s rank-based analysis to examine robustness and fairness beyond predictive accuracy. In addition, a Markov Switching time-series framework is integrated to identify shifts between different cyber risk regimes and to evaluate the practical adaptability of the models in dynamic environments. The results indicate that XGBoost achieves the strongest overall performance, combining high predictive accuracy with superior explainability, fairness, and robustness. Tree-based and gradient boosting methods demonstrate particular suitability for non-linear cybersecurity data with ordinal severity structures. The study’s main contribution lies in proposing an integrated framework that combines SAFE AI evaluation with machine learning and regime-switching analysis, providing a more comprehensive and practically adaptable approach to early cyber risk detection while supporting cost-efficient cybersecurity decision-making.
Questo studio analizza l’applicazione di SAFE AI e delle tecniche di machine learning per la rilevazione precoce e la valutazione dei rischi di attacchi informatici. L’obiettivo principale è sviluppare un framework predittivo che non solo raggiunga elevate prestazioni di classificazione, ma garantisca anche stabilità, spiegabilità, equità e robustezza in condizioni realistiche di incertezza. L’analisi si basa sul Cyber Event Database del Center for International & Security Studies at Maryland, che comprende 13.841 osservazioni pulite relative a 167 paesi, raggruppati in sei continenti. Utilizzando variabili chiave quali anno, attore, tipo di attore, settore industriale, sottotipo di evento, motivazione e continente, i modelli di attacco informatico vengono aggregati in una classificazione binaria della severità (Alta/Bassa) in base alla loro frequenza, che funge da variabile target. Vengono implementati diversi modelli di apprendimento supervisionato, tra cui regressione logistica, Random Forest, Support Vector Machine e XGBoost Classifier. Per affrontare il problema dello sbilanciamento delle classi, vengono applicati sia l’aggiustamento dei pesi delle classi sia la tecnica SMOTE, con i modelli selezionati tramite SMOTE utilizzati per la successiva valutazione tramite SAFE AI. Le prestazioni dei modelli sono valutate attraverso metriche standard di classificazione quali Accuracy, TPR, FPR e ROC AUC, integrate dall’analisi basata sul ranking proposta da SAFE AI per esaminare robustezza ed equità oltre la sola accuratezza predittiva. Inoltre, viene integrato un framework di serie temporali basato su Markov Switching per identificare i passaggi tra diversi regimi di rischio informatico e valutare l’adattabilità pratica dei modelli in ambienti dinamici. I risultati indicano che XGBoost raggiunge le migliori prestazioni complessive, combinando un’elevata accuratezza predittiva con superiori livelli di spiegabilità, equità e robustezza. I metodi basati su alberi decisionali e gradient boosting dimostrano una particolare efficacia nell’analisi di dati di cybersecurity non lineari con strutture ordinali della severità. Il principale contributo dello studio consiste nella proposta di un framework integrato che combina la valutazione SAFE AI con tecniche di machine learning e analisi regime-switching, offrendo un approccio più completo e adattabile nella pratica per la rilevazione precoce dei rischi informatici e supportando al contempo processi decisionali di cybersecurity più efficienti in termini di costi.
Applicazione di SAFE AI e del Machine Learning nella rilevazione precoce e nella valutazione dei rischi di attacchi informatici
DO, ANH QUAN
2024/2025
Abstract
This study investigates the application of SAFE AI and machine learning techniques for the early detection and evaluation of cyber attack risks. The main objective is to develop a predictive framework that not only achieves strong classification performance but also ensures stability, explainability, fairness, and robustness under realistic uncertainty conditions. The analysis is based on the Cyber Event Database from the Center for International & Security Studies at Maryland, comprising 13,841 cleaned observations across 167 countries mapped into six continents. Using key variables such as year, actor, actor type, industry, event subtype, motive, and continent, cyber attack patterns are aggregated into a binary severity classification (High/Low) according to their frequency, which serves as the target variable. Several supervised learning models are implemented, including Logistic regression, Random Forest, Support Vector Machine, and XGBoost Classifier. To address class imbalance, both class weight adjustment and SMOTE are applied, with SMOTE-selected models used for further SAFE AI evaluation. Model performance is assessed through standard classification metrics such as Accuracy, TPR, FPR, and ROC AUC, and complemented by SAFE AI’s rank-based analysis to examine robustness and fairness beyond predictive accuracy. In addition, a Markov Switching time-series framework is integrated to identify shifts between different cyber risk regimes and to evaluate the practical adaptability of the models in dynamic environments. The results indicate that XGBoost achieves the strongest overall performance, combining high predictive accuracy with superior explainability, fairness, and robustness. Tree-based and gradient boosting methods demonstrate particular suitability for non-linear cybersecurity data with ordinal severity structures. The study’s main contribution lies in proposing an integrated framework that combines SAFE AI evaluation with machine learning and regime-switching analysis, providing a more comprehensive and practically adaptable approach to early cyber risk detection while supporting cost-efficient cybersecurity decision-making.| File | Dimensione | Formato | |
|---|---|---|---|
|
Anh_Quan_Do Thesis SAFE AI With Data Breach Risks.pdf
accesso aperto
Descrizione: This study target on applying SAFE AI and machine learning methods for the early detection and
evaluation of cyber attack risks. Aim to develop a suitable framework for cyber-risk prediction in large dataset
Dimensione
2.8 MB
Formato
Adobe PDF
|
2.8 MB | Adobe PDF | Visualizza/Apri |
È consentito all'utente scaricare e condividere i documenti disponibili a testo pieno in UNITESI UNIPV nel rispetto della licenza Creative Commons del tipo CC BY NC ND.
Per maggiori informazioni e per verifiche sull'eventuale disponibilità del file scrivere a: unitesi@unipv.it.
https://hdl.handle.net/20.500.14239/34851