ML vs LLMs in NLP with application to spam detection

In today’s digital world, text-based communication such as emails, SMS, and instant messaging plays a pivotal role in personal and professional interactions. However, the persistent rise in spam messages poses significant challenges, including wasted resources, reduced productivity, security risks, and the potential misclassification of important communications. Developing accurate and efficient spam detection systems is essential to address these issues. This research explores the application of machine learning techniques, including Logistic Regression, Decision Tree, Random Forest, Support Vector Machine (SVM), XGBoost, and fine-tuned GPT-3.5 Turbo, for classifying messages as spam or ham. Various preprocessing techniques, including lowercasing, stopword removal, stemming, and lemmatization, were applied to evaluate their effects on model performance. Additionally, TF-IDF and Doc2Vec were utilized as feature extraction methods to convert text into numerical representations for effective classification. The study revealed that SVM with TF-IDF and lowercase preprocessing delivered the best overall performance in traditional machine learning models, achieving 98.26% accuracy, 99.39% precision, and a 93.10% F1-score, making it the most reliable spam detection model. Random Forest and XGBoost also demonstrated strong complementary results, particularly in balancing precision and recall. A detailed analysis using precision-recall and ROC curves provided insights into the optimal decision thresholds for minimizing false positives and maximizing spam detection. The GPT-3.5 Turbo model was trained on a small dataset of 30 SMS messages, with 30% spam, yet it achieved a high recall (98.26%), meaning it was very effective at detecting spam messages. However, this came with a trade-off—lower precision (75.28%), leading to more false positives (marking legitimate messages as spam). While GPT-3.5 Turbo is powerful, it has some practical limitations. It requires an internet connection to access OpenAI’s servers, which adds costs each time it is used. Additionally, it takes longer to process messages compared to SVM, which runs locally, for free, and much faster. On the other hand, SVM had better overall performance in terms of accuracy, F1-score, and precision, making it a more reliable and cost-effective choice for spam detection. However, its recall was lower (87.55%), meaning it missed some spam messages. If minimizing false positives is the priority, SVM is the better option. If catching every spam message is more important, GPT-3.5 Turbo is the stronger choice. This research emphasizes the importance of combining effective pre-processing techniques with robust feature extraction and machine learning algorithms to enhance spam detection performance. The findings provide a practical framework for developing reliable spam filtering systems, ensuring improved classification accuracy while minimizing the risk of misclassifying legitimate messages. These insights pave the way for future advancements in AI-driven spam detection, with potential applications in email filtering, cybersecurity, and fraud prevention across various digital communication platforms.

ML vs LLMs in NLP with application to spam detection

NOSRATI, SHARAREH

2023/2024

Abstract

In today’s digital world, text-based communication such as emails, SMS, and instant messaging plays a pivotal role in personal and professional interactions. However, the persistent rise in spam messages poses significant challenges, including wasted resources, reduced productivity, security risks, and the potential misclassification of important communications. Developing accurate and efficient spam detection systems is essential to address these issues. This research explores the application of machine learning techniques, including Logistic Regression, Decision Tree, Random Forest, Support Vector Machine (SVM), XGBoost, and fine-tuned GPT-3.5 Turbo, for classifying messages as spam or ham. Various preprocessing techniques, including lowercasing, stopword removal, stemming, and lemmatization, were applied to evaluate their effects on model performance. Additionally, TF-IDF and Doc2Vec were utilized as feature extraction methods to convert text into numerical representations for effective classification. The study revealed that SVM with TF-IDF and lowercase preprocessing delivered the best overall performance in traditional machine learning models, achieving 98.26% accuracy, 99.39% precision, and a 93.10% F1-score, making it the most reliable spam detection model. Random Forest and XGBoost also demonstrated strong complementary results, particularly in balancing precision and recall. A detailed analysis using precision-recall and ROC curves provided insights into the optimal decision thresholds for minimizing false positives and maximizing spam detection. The GPT-3.5 Turbo model was trained on a small dataset of 30 SMS messages, with 30% spam, yet it achieved a high recall (98.26%), meaning it was very effective at detecting spam messages. However, this came with a trade-off—lower precision (75.28%), leading to more false positives (marking legitimate messages as spam). While GPT-3.5 Turbo is powerful, it has some practical limitations. It requires an internet connection to access OpenAI’s servers, which adds costs each time it is used. Additionally, it takes longer to process messages compared to SVM, which runs locally, for free, and much faster. On the other hand, SVM had better overall performance in terms of accuracy, F1-score, and precision, making it a more reliable and cost-effective choice for spam detection. However, its recall was lower (87.55%), meaning it missed some spam messages. If minimizing false positives is the priority, SVM is the better option. If catching every spam message is more important, GPT-3.5 Turbo is the stronger choice. This research emphasizes the importance of combining effective pre-processing techniques with robust feature extraction and machine learning algorithms to enhance spam detection performance. The findings provide a practical framework for developing reliable spam filtering systems, ensuring improved classification accuracy while minimizing the risk of misclassifying legitimate messages. These insights pave the way for future advancements in AI-driven spam detection, with potential applications in email filtering, cybersecurity, and fraud prevention across various digital communication platforms.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				DIPARTIMENTO DI INGEGNERIA INDUSTRIALE E DELL'INFORMAZIONE
			
	Corso di studio
	
				INDUSTRIAL AUTOMATION ENGINEERING - INGEGNERIA DELL'AUTOMAZIONE INDUSTRIALE [06417]
			
	Anno Accademico
	
				2023
			
	Titolo inglese
	
				ML vs LLMs in NLP with application to spam detection
			
	Abstract in italiano
	
				In today’s digital world, text-based communication such as emails, SMS, and instant messaging plays a pivotal role in personal and professional interactions. However, the persistent rise in spam messages poses significant challenges, including wasted resources, reduced productivity, security risks, and the potential misclassification of important communications. Developing accurate and efficient spam detection systems is essential to address these issues.
This research explores the application of machine learning techniques, including Logistic Regression, Decision Tree, Random Forest, Support Vector Machine (SVM), XGBoost, and fine-tuned GPT-3.5 Turbo, for classifying messages as spam or ham. Various preprocessing techniques, including lowercasing, stopword removal, stemming, and lemmatization, were applied to evaluate their effects on model performance. Additionally, TF-IDF and Doc2Vec were utilized as feature extraction methods to convert text into numerical representations for effective classification.
The study revealed that SVM with TF-IDF and lowercase preprocessing delivered the best overall performance in traditional machine learning models, achieving 98.26% accuracy, 99.39% precision, and a 93.10% F1-score, making it the most reliable spam detection model. Random Forest and XGBoost also demonstrated strong complementary results, particularly in balancing precision and recall. A detailed analysis using precision-recall and ROC curves provided insights into the optimal decision thresholds for minimizing false positives and maximizing spam detection.
The GPT-3.5 Turbo model was trained on a small dataset of 30 SMS messages, with 30% spam, yet it achieved a high recall (98.26%), meaning it was very effective at detecting spam messages. However, this came with a trade-off—lower precision (75.28%), leading to more false positives (marking legitimate messages as spam).
While GPT-3.5 Turbo is powerful, it has some practical limitations. It requires an internet connection to access OpenAI’s servers, which adds costs each time it is used. Additionally, it takes longer to process messages compared to SVM, which runs locally, for free, and much faster.
On the other hand, SVM had better overall performance in terms of accuracy, F1-score, and precision, making it a more reliable and cost-effective choice for spam detection. However, its recall was lower (87.55%), meaning it missed some spam messages. If minimizing false positives is the priority, SVM is the better option. If catching every spam message is more important, GPT-3.5 Turbo is the stronger choice.
This research emphasizes the importance of combining effective pre-processing techniques with robust feature extraction and machine learning algorithms to enhance spam detection performance. The findings provide a practical framework for developing reliable spam filtering systems, ensuring improved classification accuracy while minimizing the risk of misclassifying legitimate messages. These insights pave the way for future advancements in AI-driven spam detection, with potential applications in email filtering, cybersecurity, and fraud prevention across various digital communication platforms.
			
	Relatore
	
				BABAEI, GOLNOOSH
GIUDICI, PAOLO STEFANO
			
	Appare nelle tipologie:
	
				Lauree Magistrali

File in questo prodotto:

File	Dimensione	Formato
Sharareh_Nosrati_Thesis_2025.pdf non disponibili Dimensione 10.35 MB Formato Adobe PDF Richiedi una copia	10.35 MB	Adobe PDF	Richiedi una copia

È consentito all'utente scaricare e condividere i documenti disponibili a testo pieno in UNITESI UNIPV nel rispetto della licenza Creative Commons del tipo CC BY NC ND.
Per maggiori informazioni e per verifiche sull'eventuale disponibilità del file scrivere a: [email protected].

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14239/33480