In this work, we introduce a new distance, the Gene Mover's Distance (GMD), to perform a classification task between human cells. The metric solves an optimization problem by using gene expression data obtained via a single-cell RNA-Seq experiment. The underlying idea is to interpret the gene expression of each cell as a discrete probability measure, which is a histogram of total mass equal to 1. The support of this probability measure is the set of the genes of the human genome, that we can represent in $\erre^{200}$ thanks to an embedding introduced recently in the literature. We solve an optimal transport problem by using the Wasserstein distance of order 2, between pairs of histograms to get a distance among cells by comparing their associated expression profile. To evaluate the performance achieved by GMD, we perform different instances of $k$-Nearest Neighbor classifier and we evaluate the reliability of the metric. We test the performances of the metric on a dataset that contains the single-cell RNA-Seq experiments from patients affected by Acute Myeloid Leukemia (AML) as well as normal (healthy) patients. In our instances, GMD is competitive with other commonly used metrics such as the Euclidean Distance and the Person Distance, which is the metric derived by the well known Pearson Correlation Coefficient. We also show that GMD often outperforms the two metrics.
In questo lavoro, introduciamo una nuova distanza, la Gene Mover’s Distance (GMD), con il fine di classificare le cellule umane. La metrica risolve un problema di ottimizzazione utilizzando i dati di espressione genica ottenuti tramite esperiment di RNA Suquencing a cellula singola. L’idea di fondo è interpretare il profilo di espressione genica di ogni cellula come una misura di probabilità discreta, ovvero un istogramma di massa totale pari a 1. Il supporto di questa misura di probabilità è l’insieme dei geni del genoma umano, che possiamo rappresentare in R^{200} grazie ad un embedding pubblicato di recente in letteratura. La distanza è ottenuta risolvendo un problema di trasporto ottimo, usando la distanza di Wasserstein di ordine 2 tra coppie di istogrammi, al fine ottenere una distanza tra le cellule confrontando il profilo di espressione associato. Per valutare le prestazioni ottenuti con la GMD, abbiamo lanciato diverse istanze K-Nearest Neighboor e valutato l’affidabilità della metrica. Abbiamo testato le prestazioni della metrica su un set di dati che contiene gli esperimenti RNA-Seq su singola cellula di pazienti affetti da Leucemia Mieloide Acuta (AML) e pazienti normali (sani). In questo caso, la GMD è competitiva con altre metriche comunemente utilizzate, come la distanza Euclidea e la distanza di Pearson, che è la metrica derivata dal noto coefficiente di correlazione di Pearson. Abbiamo poi mostrato che la GMD spesso performa meglio delle due metriche.
Gene Mover's Distance: theory and applications
VERCESI, ELEONORA
2018/2019
Abstract
In this work, we introduce a new distance, the Gene Mover's Distance (GMD), to perform a classification task between human cells. The metric solves an optimization problem by using gene expression data obtained via a single-cell RNA-Seq experiment. The underlying idea is to interpret the gene expression of each cell as a discrete probability measure, which is a histogram of total mass equal to 1. The support of this probability measure is the set of the genes of the human genome, that we can represent in $\erre^{200}$ thanks to an embedding introduced recently in the literature. We solve an optimal transport problem by using the Wasserstein distance of order 2, between pairs of histograms to get a distance among cells by comparing their associated expression profile. To evaluate the performance achieved by GMD, we perform different instances of $k$-Nearest Neighbor classifier and we evaluate the reliability of the metric. We test the performances of the metric on a dataset that contains the single-cell RNA-Seq experiments from patients affected by Acute Myeloid Leukemia (AML) as well as normal (healthy) patients. In our instances, GMD is competitive with other commonly used metrics such as the Euclidean Distance and the Person Distance, which is the metric derived by the well known Pearson Correlation Coefficient. We also show that GMD often outperforms the two metrics.È consentito all'utente scaricare e condividere i documenti disponibili a testo pieno in UNITESI UNIPV nel rispetto della licenza Creative Commons del tipo CC BY NC ND.
Per maggiori informazioni e per verifiche sull'eventuale disponibilità del file scrivere a: unitesi@unipv.it.
https://hdl.handle.net/20.500.14239/11583