Detection and Classification of Artificially Generated Images using Deep Residual Networks and Multimodal Language Models

This thesis investigates the application of deep convolutional networks and large language models to the detection of artificially generated images. As the online presence of this kind of imagery is steadily increasing, the need for reliable detection systems becomes relevant to help mitigate issues related to the malicious exploitation of synthetic content. This research explores two different technologies that can be applied to fake image detection, both of which were trained and tested on large scale datasets created by extending CIFAR-10 and ImageNet with a matching amount of artificially generated images, produced by multiple state-of-the-art generating models. The first experiments tested a convolutional neural network (CNN) called ResNet that was modified to be able to predict simultaneously which generator produced the image and the object class of the subject of the image. To understand its inner processes, Explainable AI was implemented through Grad-CAM. The second approach leveraged a multimodal language model (MLM) called TinyLLaVA that has the capability of detecting artificial images while also explaining in natural language which justifications support the answer. The results indicated that the CNN is capable of achieving a performance level comparable to the SOTA when classifying between real and fake images, while maintaining almost intact its object identification capabilities. The MLM was tested on a two-tasks custom benchmark, and while it was able to correctly detect fake images with a good degree of accuracy, it was sometimes not able to provide a valid explanation for its answer. MLMs performed worse than CNNs, but they are able to better convey information on how they achieved their results, creating a trade-off between performance and interpretability. These findings suggest a promising path towards the creation of a detection system consisting of a specialized MLM that could be optimized to be able to detect artificial images with the low error rate that is characteristic of CNNs, while maintaining an high degree of interpretability.

Questa tesi indaga l'applicazione di reti neurali convoluzionali e modelli di linguaggio al rilevamento di immagini generate artificialmente. Data la crescente presenza di immagini artificiali online, l'impiego di sistemi di rilevamento affidabili diventa necessario per mitigare i problemi legati allo sfruttamento malevolo di questo tipo di contenuti. Questa ricerca esplora due diverse tecnologie applicabili al rilevamento di immagini false, entrambe addestrate e testate su dataset su larga scala creati estendendo CIFAR-10 e ImageNet con una quantità equivalente di immagini generate artificialmente, prodotte da diversi modelli all'avanguardia. I primi esperimenti hanno testato una rete neurale convoluzionale (CNN) chiamata ResNet, modificata per prevedere simultaneamente quale generatore ha prodotto l'immagine e la classe dell'oggetto rappresentato nell'immagine. Per comprendere i processi interni di questo modello, è stata implementata la Explainable AI tramite Grad-CAM. Il secondo approccio ha sfruttato un modello multimodale di linguaggio (MLM) chiamato TinyLLaVA, capace di rilevare immagini artificiali e fornire giustificazioni in linguaggio naturale per le proprie risposte. I risultati indicano che la CNN è in grado di ottenere un livello di prestazioni paragonabile allo stato dell'arte nella classificazione tra immagini reali e false, mantenendo quasi intatte le capacità di identificazione degli oggetti proprie di ResNet. L'MLM è stato testato su un benchmark personalizzato, e sebbene fosse in grado di rilevare correttamente le immagini false con un buon grado di accuratezza, a volte non riusciva a fornire una spiegazione valida per le proprie risposte. Gli MLM hanno ottenuto risultati peggiori rispetto alle CNN, ma i primi sono in grado di comunicare meglio agli umani le informazioni su come hanno raggiunto i loro risultati, creando un compromesso tra prestazioni e interpretabilità. Questi risultati suggeriscono un percorso promettente verso la creazione di un sistema di rilevamento composto da un MLM specializzato che potrebbe essere ottimizzato per rilevare immagini artificiali con il basso tasso di errore caratteristico delle CNN, mantenendo un alto grado di interpretabilità.