Systems biology aims at studying and analyzing biological systems in terms of the interactions between cellular components. In order to infer the complex regulatory network of biological systems, different reverse engineering methodologies have been proposed. This thesis work relies on Bayesian networks (BNs), probabilistic graphical models that describe gene expression values by means of random variables and the dependencies between them by means of conditional probabilities. An extension of BNs are dynamic Bayesian networks (DBNs), which represent the temporal evolution of variables over time and are able to model loops. All reverse engineering methodologies suffer from the limited amount of the usually available experimental conditions and from measurement. However, introducing prior knowledge into the network learning process can improve the accuracy of the inferred models. The aim of this work was to develop a methodology that integrates prior knowledge in the learning of DBNs from temporal expression data. The methodology has been validated both on gold-standard networks, for which the real underlying biological regulations are known, and on a high-resolution dataset describing gene expression patterns during heart development. Learning of the DBN is addressed as a model selection problem, where network models are evaluated on the basis of their posterior probability with respect to the measured expression data. As search strategy, this works relies on the MCMC Metropolis-Hasting algorithm. This iterative stochastic procedure builds a chain of networks that, under fairly general regularity assumptions, converges to the posterior distribution, thus capturing intrinsic inference uncertainty. The network samples can then be employed to estimate marginal posterior probabilities of edges, which can be summarized into a gene network. In the BN framework prior knowledge can be introduced in a principled way, by exploiting the possibility to specify prior probabilities for the examined models. Here, STRING was chosen as prior source of knowledge. STRING is a database of known and predicted protein interactions that combines several resources and therefore it represents one of the currently most comprehensive coverage of all possible available connections. Moreover it provides a confidence score for the interactions, derived from the integration of different types of evidence. Hence, a strategy was devised in order to transform the selected knowledge into prior probabilities of gene network edges and implement it in the learning algorithm. Gold standard datasets, taken from past DREAM challenges, were used to validate the methodology. DREAM is a collaborative scientific effort to encourage discussion and improvements in reverse engineering methodologies through annual challenges. The obtained results show that the inclusion of prior knowledge improves the network learning process. The scalability of the methodology by increasing the number of variables was also analyzed and it was found that convergence worsens because of the exponential rise in complexity and computational time. The resulted methodology was subsequently applied to a heart development dataset, profiling gene expression from embryonic to postnatal state at high resolution. In this case, a “meta-gene” network has been learned: the network variable is not anymore represented by the expression of a single gene, but by a gene cluster. Thanks to this network, novel biological hypothesis could be made, which need to be experimentally validated.
La biologia dei sistemi si occupa di studiare e analizzare sistemi biologici in termini delle interazioni tra i componenti cellulari. Per inferire la complessa rete di regolazione dei sistemi biologici, differenti metodologie di reverse engineering sono state proposte. Questo lavoro di tesi è basato sull’impiego di reti Bayesiane (BN), modelli grafici probabilistici che descrivono i valori di espressione genica attraverso variabili casuali e le dipendenze tra queste per mezzo di probabilità condizionali. Un’estensione delle BN sono le reti Bayesiane dinamiche (DBN), le quali descrivono l’evoluzione temporale delle variabili nel tempo e sono capaci di modellizzare cicli. Tutte le metodologie di reverse engineering sono negativamente influenzate dal ridotto numero di condizioni sperimentali solitamente disponibili e dal rumore intrinseco nelle misurazioni. Per questa ragione, l’introduzione di conoscenza a priori nel processo di apprendimento delle reti può migliorare l’accuratezza dei modelli appresi. In questo lavoro è stata sviluppata una metodologia che integra conoscenza a priori nell’apprendimento di DBN a partire da serie temporali di espressione genica. La metodologia è stata validata sia su dati gold standard, per i quali le sottostanti regolazioni biologiche sono note, sia su dati sperimentali ad alta risoluzione di espressione genica durante lo sviluppo cardiaco. L’apprendimento di DBN può essere affrontato come un problema di selezione del modello, valutando le reti in base alla probabilità a posteriori rispetto ai dati sperimentali. Come strategia di ricerca in questo lavoro, si utilizza l’algoritmo MCMC Metropolis-Hasting. Questa procedura stocastica iterativa costruisce una catena di reti che, sotto ragionevoli assunzioni di regolarità, converge alla distribuzione di probabilità a posteriori, catturando in tal modo l’incertezza intrinseca dell’inferenza. Le reti campionate possono essere impiegate per stimare le probabilità a posteriori marginali degli archi, le quali possono essere poi riassunte in una rete genica. Il framework delle BN consente di introdurre la conoscenza a priori in una maniera “ben fondata”, sfruttando la possibilità di specificare probabilità a priori per i modelli esaminati. Si è scelto di impiegare come fonte di conoscenza a priori STRING, un database d’interazioni note e predette tra proteine. STRING combina diverse risorse e pertanto rappresenta attualmente uno dei riferimenti più completi circa le interazioni reperibili. Inoltre, per ogni relazione fornisce uno score di confidenza, derivante dall’integrazione dei differenti tipi di evidenza. È stato, quindi, necessario ideare una strategia per trasformare la conoscenza selezionata in probabilità a priori da affidare agli archi della rete genica e implementarla nell’algoritmo di apprendimento. Per la validazione della metodologia sono stati utilizzati dataset gold standard, disponibili per le sfide DREAM concluse. DREAM rappresenta un impegno scientifico comunitario nell’incentivare discussioni e miglioramenti circa le metodologie di reverse engineering per mezzo di sfide annuali. I risultati ottenuti mostrano che l’introduzione di conoscenza a priori migliora il processo di apprendimento delle reti geniche. È stata anche analizzata la scalabilità della metodologia all’aumentare del numero di variabili ed è stato riscontrato un peggioramento nella convergenza a causa dell’aumento esponenziale di complessità e tempo computazionale. In seguito, la metodologia sviluppata è stata applicata a dati riguardanti lo sviluppo cardiaco, rappresentanti l’espressione genica dallo stato embrionale a quello post-natale. In questo caso, è stata appresa una rete di “meta-geni”: si tratta di una rete in cui la variabile non è più un singolo gene ma un cluster di geni. Grazie a questa rete, sono state generate nuove ipotesi biologiche da validare sperimentalmente.
Combining prior knowledge and gene expression data to learn dynamic Bayesian networks: an application to heart development.
LIPPOLIS, ELEONORA
2014/2015
Abstract
Systems biology aims at studying and analyzing biological systems in terms of the interactions between cellular components. In order to infer the complex regulatory network of biological systems, different reverse engineering methodologies have been proposed. This thesis work relies on Bayesian networks (BNs), probabilistic graphical models that describe gene expression values by means of random variables and the dependencies between them by means of conditional probabilities. An extension of BNs are dynamic Bayesian networks (DBNs), which represent the temporal evolution of variables over time and are able to model loops. All reverse engineering methodologies suffer from the limited amount of the usually available experimental conditions and from measurement. However, introducing prior knowledge into the network learning process can improve the accuracy of the inferred models. The aim of this work was to develop a methodology that integrates prior knowledge in the learning of DBNs from temporal expression data. The methodology has been validated both on gold-standard networks, for which the real underlying biological regulations are known, and on a high-resolution dataset describing gene expression patterns during heart development. Learning of the DBN is addressed as a model selection problem, where network models are evaluated on the basis of their posterior probability with respect to the measured expression data. As search strategy, this works relies on the MCMC Metropolis-Hasting algorithm. This iterative stochastic procedure builds a chain of networks that, under fairly general regularity assumptions, converges to the posterior distribution, thus capturing intrinsic inference uncertainty. The network samples can then be employed to estimate marginal posterior probabilities of edges, which can be summarized into a gene network. In the BN framework prior knowledge can be introduced in a principled way, by exploiting the possibility to specify prior probabilities for the examined models. Here, STRING was chosen as prior source of knowledge. STRING is a database of known and predicted protein interactions that combines several resources and therefore it represents one of the currently most comprehensive coverage of all possible available connections. Moreover it provides a confidence score for the interactions, derived from the integration of different types of evidence. Hence, a strategy was devised in order to transform the selected knowledge into prior probabilities of gene network edges and implement it in the learning algorithm. Gold standard datasets, taken from past DREAM challenges, were used to validate the methodology. DREAM is a collaborative scientific effort to encourage discussion and improvements in reverse engineering methodologies through annual challenges. The obtained results show that the inclusion of prior knowledge improves the network learning process. The scalability of the methodology by increasing the number of variables was also analyzed and it was found that convergence worsens because of the exponential rise in complexity and computational time. The resulted methodology was subsequently applied to a heart development dataset, profiling gene expression from embryonic to postnatal state at high resolution. In this case, a “meta-gene” network has been learned: the network variable is not anymore represented by the expression of a single gene, but by a gene cluster. Thanks to this network, novel biological hypothesis could be made, which need to be experimentally validated.È consentito all'utente scaricare e condividere i documenti disponibili a testo pieno in UNITESI UNIPV nel rispetto della licenza Creative Commons del tipo CC BY NC ND.
Per maggiori informazioni e per verifiche sull'eventuale disponibilità del file scrivere a: unitesi@unipv.it.
https://hdl.handle.net/20.500.14239/17967