Diseases are often umbrella terms for several subcategories of illness, whose identification has been becoming increasingly important, especially in the perspective of personalized treatments that are better focused on individual patients. Therefore, the main idea of this research activity was to explore the use of a combination of unsupervised learning to identify potential subclasses, and supervised learning to build models for better predicting a number of different health outcomes. We have analysed patients affected by systemic sclerosis (SSc), a rare and heterogeneous disease characterized by a complex interplay of vascular injury, immune system activation and fibrosis, with the highest disease-related mortality rate among rheumatologic disorders. Although different classification systems have been proposed, the most widely accepted clinical method of subdividing SSc is by the pattern of skin involvements. This identifies two main subsets: limited cutaneous SSc, where skin thickness affects only areas distal to elbows and knees, and diffuse cutaneous SSc, where skin involvement can affect the whole body, but they are unlikely to be the only subcategories. Even within these pathologically defined patient groups, there remains a wide variability in presentation, severity, treatment response and prognosis. De facto, these differentiations in clinical phenotype may represent different systemic sclerosis subclasses and their discovery is essential if we are to make more informed diagnosis. Due to a partnership between the Computer Science Department at Brunel University London and the Centre for Rheumatology and Connective Tissue Diseases (CTDs) at the Royal Free London Hospital, it was possible to work on a dataset of 625 systemic sclerosis subjects with the disease onset between January 1995 and December 2003, followed up to fifteen years. Patients underwent blood and lung function tests, and they were analysed for burden of clinically significant organ involvements and survival. Based on these clinical data, we have explored a number of different experimental architectures using unsupervised methods to pre-process patients into distinctive cohorts to identify variations in symptoms and recognize potential disease subclasses. The best results in terms of weighted kappa and silhouette index have been achieved through a consensus clustering technique in two distinct ways. Afterwards, we have used these results as a new feature for better predicting a number of different internal organ complications, estimating if they might happen before or after a specific temporal threshold. To do that, we have performed a k-fold cross validations using several supervised classifiers from different areas of machine learning, considering also an algorithm for rule induction. Furthermore, we have executed statistical and survival analyses to show that patients in diverse disease subcategories have a different risk of death or main organ involvements. Finally, we have shown that the subcategory discovery not only improves prediction, significantly though not substantially, but also that the discovered rules incorporate subcategory information which can be directly interpreted to better understand the meaning of the new subclasses of systemic sclerosis.
Scoperta di sottoclassi di patologia mediante metodi di apprendimento supervisionato e non supervisionato
Discovery of disease subclasses by combining supervised and unsupervised learning
BOSONI, PIETRO
2015/2016
Abstract
Diseases are often umbrella terms for several subcategories of illness, whose identification has been becoming increasingly important, especially in the perspective of personalized treatments that are better focused on individual patients. Therefore, the main idea of this research activity was to explore the use of a combination of unsupervised learning to identify potential subclasses, and supervised learning to build models for better predicting a number of different health outcomes. We have analysed patients affected by systemic sclerosis (SSc), a rare and heterogeneous disease characterized by a complex interplay of vascular injury, immune system activation and fibrosis, with the highest disease-related mortality rate among rheumatologic disorders. Although different classification systems have been proposed, the most widely accepted clinical method of subdividing SSc is by the pattern of skin involvements. This identifies two main subsets: limited cutaneous SSc, where skin thickness affects only areas distal to elbows and knees, and diffuse cutaneous SSc, where skin involvement can affect the whole body, but they are unlikely to be the only subcategories. Even within these pathologically defined patient groups, there remains a wide variability in presentation, severity, treatment response and prognosis. De facto, these differentiations in clinical phenotype may represent different systemic sclerosis subclasses and their discovery is essential if we are to make more informed diagnosis. Due to a partnership between the Computer Science Department at Brunel University London and the Centre for Rheumatology and Connective Tissue Diseases (CTDs) at the Royal Free London Hospital, it was possible to work on a dataset of 625 systemic sclerosis subjects with the disease onset between January 1995 and December 2003, followed up to fifteen years. Patients underwent blood and lung function tests, and they were analysed for burden of clinically significant organ involvements and survival. Based on these clinical data, we have explored a number of different experimental architectures using unsupervised methods to pre-process patients into distinctive cohorts to identify variations in symptoms and recognize potential disease subclasses. The best results in terms of weighted kappa and silhouette index have been achieved through a consensus clustering technique in two distinct ways. Afterwards, we have used these results as a new feature for better predicting a number of different internal organ complications, estimating if they might happen before or after a specific temporal threshold. To do that, we have performed a k-fold cross validations using several supervised classifiers from different areas of machine learning, considering also an algorithm for rule induction. Furthermore, we have executed statistical and survival analyses to show that patients in diverse disease subcategories have a different risk of death or main organ involvements. Finally, we have shown that the subcategory discovery not only improves prediction, significantly though not substantially, but also that the discovered rules incorporate subcategory information which can be directly interpreted to better understand the meaning of the new subclasses of systemic sclerosis.È consentito all'utente scaricare e condividere i documenti disponibili a testo pieno in UNITESI UNIPV nel rispetto della licenza Creative Commons del tipo CC BY NC ND.
Per maggiori informazioni e per verifiche sull'eventuale disponibilità del file scrivere a: unitesi@unipv.it.
https://hdl.handle.net/20.500.14239/21926