Integrazione di dati multi-omici con l’IA generativa: un approccio per predire il profilo di metilazione dal DNA

The emergence and advances of Next Generation Sequencing (NGS) technologies in the last decades have led to a growing availability of “omics” data, biological data that include, for example, genomics (DNA-seq), transcriptomics (RNA-seq) and epigenomics (modifications that impact gene activity without changing the DNA sequence, such as methylation). In recent years, there has been a growing interest in the integration of several omics, as individually they are not always sufficient to describe a biological process or diagnose a disease, since they interact and influence each other in complex ways. The complexity of the integration and interpretation of multi-omic data and their large volume has raised interest in the use of the latest innovations in generative AI, such as Large Language Models (LLM), which have proven to be effective in the analysis of natural language sequences and can therefore be effectively applied, as well, to biological sequences, processed analogously to textual sequences thanks to a tokenization strategy. This thesis work, performed in collaboration with enGenome srl (a spin-off of the University of Pavia), aims to explore, analyze, test and compare different generative AI models for genomic sequences. These models, described in the literature, have been applied to the analysis of data from one or more “omic” source. First, models working on a single type of omic data at a time were considered. Among these, models specialized on DNA sequences can predict genomic features (such as the presence of a promoter sequence, or the effect of a damaging DNA mutation) starting from the DNA sequence itself. Furthermore, there are models trained to predict the methylation profile on a genomic region from its DNA sequence. More complex models are capable of integrating more than one omic data type, such as DNA-seq data with RNA-seq data. In this case, the goal is predicting gene expression levels from their DNA sequence. In this thesis work, for each selected model, first the expected behaviour has been reproduced in the same experimental conditions described in the original works. Second, the performances of the models were compared and discussed on ad-hoc built benchmark datasets. Since these models have specific GPU requirements, the tests have been performed using the HPC resources of the Leonardo cluster provided by Cineca, within a dedicated project. First, models capable of predicting various genomic functions from DNA were tested. Then, since in most clinical contexts methylation data is less available than DNA-seq data, the main focus of the thesis has been the analysis of models that predict methylation from DNA, with the aim of understanding their architecture and limitations. After testing these models on state-of-the-art benchmark datasets, a new dataset was carefully selected and created with genomic, transcriptomic and methylation data from Colorectal cancer (CRC) samples associated with the CpG Island Methyalted Phenotype (CIMP), retrieved from the TCGA public database. The aim was to highlight their ineffectiveness in reliably predicting methylation in a pathological context, despite working well in “healthy” conditions, due to not integrating information other than the target DNA sequence as input. The exploratory analysis on this dataset and model performance confirmed the need to integrate additional input features. Therefore, in conclusion of the thesis, a possible multi-omic model architecture aimed at methylation prediction through the integration of omics and metadata data was proposed. The results illustrated in this thesis work could potentially lay the foundation for the development of multi-omic generative AI models, aiming at improving the diagnosis of genetic diseases and accelerating the development of personalized medicine treatments based on the patient's multi-omic profile.

Lo sviluppo e l’avanzamento delle tecnologie di Next Generation Sequencing (NGS) ha permesso di avere a disposizione una quantità crescente di dati “omici”, ovvero dati biologici che includono, per esempio, dati genomici (DNA-seq), trascrittomici (RNA-seq) ed epigenomici (modifiche che influiscono sull'attività dei geni senza cambiare la sequenza del DNA, come la metilazione). Negli ultimi anni si è assistito a un crescente interesse per l'integrazione di diverse omiche, poiché da sole non sono sempre sufficienti a descrivere un processo biologico o a diagnosticare una malattia, in quanto interagiscono e si influenzano reciprocamente in modi complessi. La complessità dell’interpretazione delle omiche e il grande volume dei dati a disposizione ha fatto sorgere l’interesse verso le ultime innovazioni dell’AI generativa, come i Large Language Models (LLM) che si sono dimostrati efficaci nell’analisi di sequenze di linguaggio naturale e che quindi si possono prestare bene anche all’applicazione a sequenze biologiche, processate analogamente a sequenze testuali grazie ad una strategia di tokenizzazione. Questo lavoro di tesi, svolto in collaborazione con enGenome srl (spin-off dell'Università di Pavia), si propone di esplorare, analizzare, testare e confrontare vari modelli di IA generativa, descritti in letteratura, applicati all'analisi di dati di una o più omiche. Innanzitutto, sono stati analizzati modelli che lavorano su un solo tipo di dati omici. Tra questi, ci sono modelli di DNA che possono prevedere caratteristiche genomiche (come la presenza di un promotore o l'effetto di una mutazione) a partire dalla sequenza stessa di DNA e modelli addestrati a prevedere il profilo di metilazione su una regione genomica. Modelli più complessi sono in grado di integrare più di tipi di dati omici, come dati DNA-seq con dati RNA-seq, con l'obiettivo di prevedere l'espressione dei geni dalla loro sequenza di DNA. Per ogni modello selezionato sono stati innanzitutto riprodotti i risultati attesi nelle stesse condizioni sperimentali descritte nei lavori originali. In seguito, le prestazioni dei modelli sono state confrontate e discusse su altri dataset costruiti ad hoc. Poiché i modelli richiedono particolari requisiti in termini di GPU, i test sono stati eseguiti utilizzando le risorse di HPC del cluster Leonardo fornito dal Cineca, all'interno di un progetto dedicato. Per prima cosa, sono stati testati modelli in grado di prevedere caratteristiche genomiche a partire dalla sequenza di DNA. Successivamente, poiché in generale i dati di metilazione sono meno disponibili rispetto ai dati di DNA-seq, il focus della tesi è stato l'analisi di modelli che predicono la metilazione dal DNA, per comprenderne l'architettura e i limiti. Dopo aver testato questi modelli su dataset di riferimento, è stato accuratamente selezionato e creato un nuovo dataset comprendente dati genomici, trascrittomici e di metilazione provenienti da campioni di cancro al colonretto associati al CpG Island Methyalted Phenotype (CIMP), recuperati dal database pubblico TCGA. L'obiettivo era quello di evidenziare i loro limiti nel predire in modo affidabile la metilazione in un contesto patologico, nonostante mostrino buone performance in condizioni fisiologiche, a causa della mancata integrazione di informazioni diverse dalla sequenza di DNA target come input. L'analisi esplorativa su questo dataset e le prestazioni dei modelli hanno confermato la necessità di integrare ulteriori informazioni. In conclusione, è stata proposta un'ipotesi di architettura di modello multi-omico finalizzata alla predizione della metilazione integrando dati omici e metadati. I risultati presentati in questa tesi potrebbero gettare le basi per lo sviluppo di modelli di IA generativa multi-omici, per migliorare la diagnosi di malattie genetiche e accelerare lo sviluppo di trattamenti personalizzati basati sul profilo multi-omico del paziente.