In today’s digital world, text-based communication such as emails, SMS, and instant messaging plays a pivotal role in personal and professional interactions. However, the persistent rise in spam messages poses significant challenges, including wasted resources, reduced productivity, security risks, and the potential misclassification of important communications. Developing accurate and efficient spam detection systems is essential to address these issues. This research explores the application of machine learning techniques, including Logistic Regression, Decision Tree, Random Forest, Support Vector Machine (SVM), XGBoost, and fine-tuned GPT-3.5 Turbo, for classifying messages as spam or ham. Various preprocessing techniques, including lowercasing, stopword removal, stemming, and lemmatization, were applied to evaluate their effects on model performance. Additionally, TF-IDF and Doc2Vec were utilized as feature extraction methods to convert text into numerical representations for effective classification. The study revealed that SVM with TF-IDF and lowercase preprocessing delivered the best overall performance in traditional machine learning models, achieving 98.26% accuracy, 99.39% precision, and a 93.10% F1-score, making it the most reliable spam detection model. Random Forest and XGBoost also demonstrated strong complementary results, particularly in balancing precision and recall. A detailed analysis using precision-recall and ROC curves provided insights into the optimal decision thresholds for minimizing false positives and maximizing spam detection. The GPT-3.5 Turbo model was trained on a small dataset of 30 SMS messages, with 30% spam, yet it achieved a high recall (98.26%), meaning it was very effective at detecting spam messages. However, this came with a trade-off—lower precision (75.28%), leading to more false positives (marking legitimate messages as spam). While GPT-3.5 Turbo is powerful, it has some practical limitations. It requires an internet connection to access OpenAI’s servers, which adds costs each time it is used. Additionally, it takes longer to process messages compared to SVM, which runs locally, for free, and much faster. On the other hand, SVM had better overall performance in terms of accuracy, F1-score, and precision, making it a more reliable and cost-effective choice for spam detection. However, its recall was lower (87.55%), meaning it missed some spam messages. If minimizing false positives is the priority, SVM is the better option. If catching every spam message is more important, GPT-3.5 Turbo is the stronger choice. This research emphasizes the importance of combining effective pre-processing techniques with robust feature extraction and machine learning algorithms to enhance spam detection performance. The findings provide a practical framework for developing reliable spam filtering systems, ensuring improved classification accuracy while minimizing the risk of misclassifying legitimate messages. These insights pave the way for future advancements in AI-driven spam detection, with potential applications in email filtering, cybersecurity, and fraud prevention across various digital communication platforms.

In today’s digital world, text-based communication such as emails, SMS, and instant messaging plays a pivotal role in personal and professional interactions. However, the persistent rise in spam messages poses significant challenges, including wasted resources, reduced productivity, security risks, and the potential misclassification of important communications. Developing accurate and efficient spam detection systems is essential to address these issues. This research explores the application of machine learning techniques, including Logistic Regression, Decision Tree, Random Forest, Support Vector Machine (SVM), XGBoost, and fine-tuned GPT-3.5 Turbo, for classifying messages as spam or ham. Various preprocessing techniques, including lowercasing, stopword removal, stemming, and lemmatization, were applied to evaluate their effects on model performance. Additionally, TF-IDF and Doc2Vec were utilized as feature extraction methods to convert text into numerical representations for effective classification. The study revealed that SVM with TF-IDF and lowercase preprocessing delivered the best overall performance in traditional machine learning models, achieving 98.26% accuracy, 99.39% precision, and a 93.10% F1-score, making it the most reliable spam detection model. Random Forest and XGBoost also demonstrated strong complementary results, particularly in balancing precision and recall. A detailed analysis using precision-recall and ROC curves provided insights into the optimal decision thresholds for minimizing false positives and maximizing spam detection. The GPT-3.5 Turbo model was trained on a small dataset of 30 SMS messages, with 30% spam, yet it achieved a high recall (98.26%), meaning it was very effective at detecting spam messages. However, this came with a trade-off—lower precision (75.28%), leading to more false positives (marking legitimate messages as spam). While GPT-3.5 Turbo is powerful, it has some practical limitations. It requires an internet connection to access OpenAI’s servers, which adds costs each time it is used. Additionally, it takes longer to process messages compared to SVM, which runs locally, for free, and much faster. On the other hand, SVM had better overall performance in terms of accuracy, F1-score, and precision, making it a more reliable and cost-effective choice for spam detection. However, its recall was lower (87.55%), meaning it missed some spam messages. If minimizing false positives is the priority, SVM is the better option. If catching every spam message is more important, GPT-3.5 Turbo is the stronger choice. This research emphasizes the importance of combining effective pre-processing techniques with robust feature extraction and machine learning algorithms to enhance spam detection performance. The findings provide a practical framework for developing reliable spam filtering systems, ensuring improved classification accuracy while minimizing the risk of misclassifying legitimate messages. These insights pave the way for future advancements in AI-driven spam detection, with potential applications in email filtering, cybersecurity, and fraud prevention across various digital communication platforms.

ML vs LLMs in NLP with application to spam detection

NOSRATI, SHARAREH
2023/2024

Abstract

In today’s digital world, text-based communication such as emails, SMS, and instant messaging plays a pivotal role in personal and professional interactions. However, the persistent rise in spam messages poses significant challenges, including wasted resources, reduced productivity, security risks, and the potential misclassification of important communications. Developing accurate and efficient spam detection systems is essential to address these issues. This research explores the application of machine learning techniques, including Logistic Regression, Decision Tree, Random Forest, Support Vector Machine (SVM), XGBoost, and fine-tuned GPT-3.5 Turbo, for classifying messages as spam or ham. Various preprocessing techniques, including lowercasing, stopword removal, stemming, and lemmatization, were applied to evaluate their effects on model performance. Additionally, TF-IDF and Doc2Vec were utilized as feature extraction methods to convert text into numerical representations for effective classification. The study revealed that SVM with TF-IDF and lowercase preprocessing delivered the best overall performance in traditional machine learning models, achieving 98.26% accuracy, 99.39% precision, and a 93.10% F1-score, making it the most reliable spam detection model. Random Forest and XGBoost also demonstrated strong complementary results, particularly in balancing precision and recall. A detailed analysis using precision-recall and ROC curves provided insights into the optimal decision thresholds for minimizing false positives and maximizing spam detection. The GPT-3.5 Turbo model was trained on a small dataset of 30 SMS messages, with 30% spam, yet it achieved a high recall (98.26%), meaning it was very effective at detecting spam messages. However, this came with a trade-off—lower precision (75.28%), leading to more false positives (marking legitimate messages as spam). While GPT-3.5 Turbo is powerful, it has some practical limitations. It requires an internet connection to access OpenAI’s servers, which adds costs each time it is used. Additionally, it takes longer to process messages compared to SVM, which runs locally, for free, and much faster. On the other hand, SVM had better overall performance in terms of accuracy, F1-score, and precision, making it a more reliable and cost-effective choice for spam detection. However, its recall was lower (87.55%), meaning it missed some spam messages. If minimizing false positives is the priority, SVM is the better option. If catching every spam message is more important, GPT-3.5 Turbo is the stronger choice. This research emphasizes the importance of combining effective pre-processing techniques with robust feature extraction and machine learning algorithms to enhance spam detection performance. The findings provide a practical framework for developing reliable spam filtering systems, ensuring improved classification accuracy while minimizing the risk of misclassifying legitimate messages. These insights pave the way for future advancements in AI-driven spam detection, with potential applications in email filtering, cybersecurity, and fraud prevention across various digital communication platforms.
2023
ML vs LLMs in NLP with application to spam detection
In today’s digital world, text-based communication such as emails, SMS, and instant messaging plays a pivotal role in personal and professional interactions. However, the persistent rise in spam messages poses significant challenges, including wasted resources, reduced productivity, security risks, and the potential misclassification of important communications. Developing accurate and efficient spam detection systems is essential to address these issues. This research explores the application of machine learning techniques, including Logistic Regression, Decision Tree, Random Forest, Support Vector Machine (SVM), XGBoost, and fine-tuned GPT-3.5 Turbo, for classifying messages as spam or ham. Various preprocessing techniques, including lowercasing, stopword removal, stemming, and lemmatization, were applied to evaluate their effects on model performance. Additionally, TF-IDF and Doc2Vec were utilized as feature extraction methods to convert text into numerical representations for effective classification. The study revealed that SVM with TF-IDF and lowercase preprocessing delivered the best overall performance in traditional machine learning models, achieving 98.26% accuracy, 99.39% precision, and a 93.10% F1-score, making it the most reliable spam detection model. Random Forest and XGBoost also demonstrated strong complementary results, particularly in balancing precision and recall. A detailed analysis using precision-recall and ROC curves provided insights into the optimal decision thresholds for minimizing false positives and maximizing spam detection. The GPT-3.5 Turbo model was trained on a small dataset of 30 SMS messages, with 30% spam, yet it achieved a high recall (98.26%), meaning it was very effective at detecting spam messages. However, this came with a trade-off—lower precision (75.28%), leading to more false positives (marking legitimate messages as spam). While GPT-3.5 Turbo is powerful, it has some practical limitations. It requires an internet connection to access OpenAI’s servers, which adds costs each time it is used. Additionally, it takes longer to process messages compared to SVM, which runs locally, for free, and much faster. On the other hand, SVM had better overall performance in terms of accuracy, F1-score, and precision, making it a more reliable and cost-effective choice for spam detection. However, its recall was lower (87.55%), meaning it missed some spam messages. If minimizing false positives is the priority, SVM is the better option. If catching every spam message is more important, GPT-3.5 Turbo is the stronger choice. This research emphasizes the importance of combining effective pre-processing techniques with robust feature extraction and machine learning algorithms to enhance spam detection performance. The findings provide a practical framework for developing reliable spam filtering systems, ensuring improved classification accuracy while minimizing the risk of misclassifying legitimate messages. These insights pave the way for future advancements in AI-driven spam detection, with potential applications in email filtering, cybersecurity, and fraud prevention across various digital communication platforms.
File in questo prodotto:
File Dimensione Formato  
Sharareh_Nosrati_Thesis_2025.pdf

non disponibili

Dimensione 10.35 MB
Formato Adobe PDF
10.35 MB Adobe PDF   Richiedi una copia

È consentito all'utente scaricare e condividere i documenti disponibili a testo pieno in UNITESI UNIPV nel rispetto della licenza Creative Commons del tipo CC BY NC ND.
Per maggiori informazioni e per verifiche sull'eventuale disponibilità del file scrivere a: unitesi@unipv.it.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14239/33480