The thesis activity carried out at the Laboratory of Biomedical Informatics "Mario Stefanelli" is part of the 4CE Consortium (Consortium for Clinical Characterization of COVID-19 by EHR), a voluntary international organization promoted by members of the research community i2b2 (Informatics for Integrating Biology and the Bedside). The consortium was founded in March 2020, during the first phase of the global pandemic caused by COVID-19 (Coronavirus Disease 19), aiming to provide clinicians, epidemiologists and public health decision makers with evidence and results about the course of the disease, through a shared model of Electronic Health Records (EHR) data in a Distributed Research Network (DRN) perspective. The work presented in this thesis focuses on the analysis of the sequelae of COVID-19 in the period after the acute phase. In fact, for many subjects recovery from the acute phase of SARS-CoV-2 infection, the coronavirus that causes the disease, may be grueling with lingering effects. Examples include physical symptoms (e.g., fatigue, dyspnea, chest pain, cough) and neurological symptoms (e.g., impaired concentration and memory) that may appear during or after the acute phase and last for weeks or even months. All these symptoms are characterized as PASC (Post-Acute Sequelae of COVID-19) or post-COVID syndrome and can manifest in different ways according to the characteristics of the subject (age, sex, clinical history). Therefore, accurate identification of PASC phenotypes and associated risk factors through EHR data is important to guide future research and help health systems focus their efforts and resources on adequately controlled age-gender-and-clinical-history specific sequelae of a COVID-19 infection. My research has two goals. The first goal concerns the definition of post-COVID syndrome phenotypes with EHR clinical markers (diagnosis, laboratory tests, medications, procedures). This objective has been achieved through the development of a phenotyping framework that, starting from a list of phenotypes identified and partially defined by clinicians, selects those with higher prevalence and implements a Phenome-Wide Association Study (PheWAS) by means of a machine learning algorithm, known as MLHO (Machine Learns Health Outcomes). The algorithm allows to include in the definition of phenotypes their temporal dimension related to disease progression, extracting patterns that identify temporal phenotypes. In this way, you can find new features that, evaluated by clinicians, can be added in the new definition of phenotypes. The idea is to iteratively apply this process until there are no more relevant clinical markers for redefining the phenotypes. This first work was implemented in an R package, which was then applied to data from several 4CE sites, in order to obtain reliable and generalizable results. The second goal concerns the development of a predictive model for the phenotypes, by identifying risk factors for PASCs in the EHR data. Two approaches were applied here. The first involves the application of the pipeline just described with a change in the time span of the observations: in fact, if in the phenotyping only data from the post-acute phase are used, in this case the input observations come from the acute and pre-acute phases, when available. The second approach aims to identify predictive patterns, using a methodology of temporal Sequential Pattern Mining (tSPM) realized through cSPADE and APRIORI algorithms. In this case the analysis was performed only to data extracted from local sites (ICS Maugeri and Policlinico San Matteo).
L’attività di tesi svolta presso il Laboratorio di Informatica Biomedica “Mario Stefanelli” si inserisce nell’ambito del Consorzio 4CE (Consortium for Clinical Characterization of COVID-19 by EHR), un’organizzazione internazionale volontaria promossa dai membri della comunità di ricerca i2b2. Il consorzio nasce a Marzo 2020, durante la prima fase della pandemia globale causata dal COVID-19, con l’obiettivo di fornire a clinici, epidemiologici e decisori per la salute pubblica evidenze e risultati riguardo il decorso della patologia, attraverso un modello condiviso di gestione dei dati estratti da cartelle cliniche elettroniche (EHR), in un’ottica di network di ricerca distribuiti (DRN). Il lavoro presentato in questa tesi si focalizza sull’analisi delle sequele del COVID-19 nel periodo successivo alla fase acuta. Infatti, per molti soggetti il recupero dalla fase acuta dell’infezione da SARS-CoV-2, il coronavirus che causa la patologia, può essere caratterizzato da conseguenze destabilizzanti e durature. Ne sono un esempio alcuni sintomi fisici (fatica, dispnea, dolore al petto, tosse) e neurologici (diminuzione della concentrazione e della memoria) che possono manifestarsi durante o dopo la fase acuta e perdurare per settimane o, addirittura, mesi. L’insieme di questi sintomi viene identificato con il termine PASC (Post-Acute Sequelae of COVID-19) o sindrome post-COVID e può manifestarsi in modi diversi in base alle caratteristiche del soggetto (età, sesso, storia clinica). Pertanto un’ identificazione accurata dei loro fenotipi e dei fattori di rischio ad essi associati attraverso i dati collezionati nell’ EHR è fondamentale per guidare la futura ricerca e per supportare i sistemi sanitari nel concentrare i propri sforzi e le proprie risorse nei controlli delle sequele del COVID-19 specifici per sesso, età, cronicità pregresse. La mia ricerca si focalizza su due obiettivi. Il primo obiettivo riguarda la definizione dei fenotipi della sindrome post-COVID tramite marker clinici presenti nell’EHR (diagnosi, esami di laboratorio, farmaci, procedure). Il raggiungimento di questo obiettivo è stato ottenuto grazie allo sviluppo di un framework di fenotipizzazione che, partendo da una lista di fenotipi identificati e parzialmente definiti dai clinici, seleziona quelli a maggiore prevalenza e su di essi implementa uno studio di associazione a livello di fenoma (PheWAS) tramite l’applicazione di un algoritmo di machine learning, noto come MLHO. Con questo algoritmo è possibile includere nella definizione dei fenotipi la loro dimensione temporale relativa alla progressione della malattia, estraendo dei pattern che indentificano i fenotipi temporali. L’output consente di individuare nuove caratteristiche che, sottoposte alla valutazione da parte dei clinici, possono essere aggiunte nella nuova definizione di fenotipi. L’idea è quella di applicare iterativamente questo processo, finché non ci siano più marker clinici rilevanti per la ridefinizione del fenotipo. Questo primo lavoro è stato implementato in un pacchetto R, che è stato poi applicato ai dati di diversi ospedali appartenenti al Consorzio 4CE, in modo da ottenere risultati affidabili e generalizzabili. Il secondo obiettivo riguarda lo sviluppo di un modello predittivo per tali fenotipi, identificando nei dati EHR dei fattori di rischio per i PASC. In questo caso sono stati applicati due approcci. Il primo prevede l’applicazione della pipeline appena descritta con un cambiamento dell’arco temporale delle osservazioni: infatti, se nella fenotipizzazione vengono utilizzati solo i dati della fase post-acuta, in questo caso le osservazioni che vengono date in input sono quelle delle fasi acuta e pre-acuta, quando disponibili. Il secondo approccio mira ad identificare dei pattern predittivi, utilizzando una metodologia di analisi di pattern di sequenze temporali (tSPM) tramite gli algoritmi cSPADE e APRIORI.
Machine Learning framework to define and predict temporal phenotypes in Post-Acute Sequelae of COVID-19 (PASC)
MESA, REBECCA
2020/2021
Abstract
The thesis activity carried out at the Laboratory of Biomedical Informatics "Mario Stefanelli" is part of the 4CE Consortium (Consortium for Clinical Characterization of COVID-19 by EHR), a voluntary international organization promoted by members of the research community i2b2 (Informatics for Integrating Biology and the Bedside). The consortium was founded in March 2020, during the first phase of the global pandemic caused by COVID-19 (Coronavirus Disease 19), aiming to provide clinicians, epidemiologists and public health decision makers with evidence and results about the course of the disease, through a shared model of Electronic Health Records (EHR) data in a Distributed Research Network (DRN) perspective. The work presented in this thesis focuses on the analysis of the sequelae of COVID-19 in the period after the acute phase. In fact, for many subjects recovery from the acute phase of SARS-CoV-2 infection, the coronavirus that causes the disease, may be grueling with lingering effects. Examples include physical symptoms (e.g., fatigue, dyspnea, chest pain, cough) and neurological symptoms (e.g., impaired concentration and memory) that may appear during or after the acute phase and last for weeks or even months. All these symptoms are characterized as PASC (Post-Acute Sequelae of COVID-19) or post-COVID syndrome and can manifest in different ways according to the characteristics of the subject (age, sex, clinical history). Therefore, accurate identification of PASC phenotypes and associated risk factors through EHR data is important to guide future research and help health systems focus their efforts and resources on adequately controlled age-gender-and-clinical-history specific sequelae of a COVID-19 infection. My research has two goals. The first goal concerns the definition of post-COVID syndrome phenotypes with EHR clinical markers (diagnosis, laboratory tests, medications, procedures). This objective has been achieved through the development of a phenotyping framework that, starting from a list of phenotypes identified and partially defined by clinicians, selects those with higher prevalence and implements a Phenome-Wide Association Study (PheWAS) by means of a machine learning algorithm, known as MLHO (Machine Learns Health Outcomes). The algorithm allows to include in the definition of phenotypes their temporal dimension related to disease progression, extracting patterns that identify temporal phenotypes. In this way, you can find new features that, evaluated by clinicians, can be added in the new definition of phenotypes. The idea is to iteratively apply this process until there are no more relevant clinical markers for redefining the phenotypes. This first work was implemented in an R package, which was then applied to data from several 4CE sites, in order to obtain reliable and generalizable results. The second goal concerns the development of a predictive model for the phenotypes, by identifying risk factors for PASCs in the EHR data. Two approaches were applied here. The first involves the application of the pipeline just described with a change in the time span of the observations: in fact, if in the phenotyping only data from the post-acute phase are used, in this case the input observations come from the acute and pre-acute phases, when available. The second approach aims to identify predictive patterns, using a methodology of temporal Sequential Pattern Mining (tSPM) realized through cSPADE and APRIORI algorithms. In this case the analysis was performed only to data extracted from local sites (ICS Maugeri and Policlinico San Matteo).È consentito all'utente scaricare e condividere i documenti disponibili a testo pieno in UNITESI UNIPV nel rispetto della licenza Creative Commons del tipo CC BY NC ND.
Per maggiori informazioni e per verifiche sull'eventuale disponibilità del file scrivere a: unitesi@unipv.it.
https://hdl.handle.net/20.500.14239/14480