Celiac disease (CD) is a chronic autoimmune disorder triggered by gluten ingestion, leading to a wide spectrum of gastrointestinal and extraintestinal symptoms. Traditionally, CD has been classified into major, minor, and silent categories based on clinical presentation. However, these broad classifications may fail to capture the full heterogeneity of the disease, potentially leading to misdiagnosis or delayed treatment. In this study, latent class analysis (LCA) was employed to identify new CD subtypes solely based on symptomatology, independent of serological or genetic markers. Data from 2,478 adult CD patients, collected from 19 Italian centers between 2011 and 2021, were analyzed, with clinical, laboratory, endoscopic, and histological data being systematically gathered to ensure a comprehensive evaluation of the disease. Four distinct symptom-based clusters were identified using features such as gastrointestinal, hematological, neuropsychiatric manifestations, and fatigue. The optimal number of latent clusters was determined using the Bayesian Information Criterion (BIC). These newly derived classes were compared with the traditional classification using Chi-squared tests. To further validate the new subtypes, supervised machine learning models were applied to assess their predictive performance against the conventional framework. The results indicated that while the LCA-derived subtypes enhanced the differentiation of patient subgroups, their ability to predict disease complications remained comparable to that of the traditional classification system. Although these new subtypes partly overlapped with conventional categories, they also revealed significant discrepancies, including instances where patients with few or no gastrointestinal symptoms were still labeled as “Major” diseases in the traditional classification. Such mismatches indicate that current clinical frameworks may oversimplify CD, potentially overlooking patients who remain susceptible to complications despite minimal symptoms. Such inconsistencies underscore the need for more robust classification methods to enhance diagnostic accuracy and clinical management. While data-driven approaches such as LCA provide valuable insights into CD heterogeneity, integrating additional factors, such as genetic predisposition, serological markers, and environmental influences, could further refine disease classification and improve complication prediction. In addition, to validate these findings, supervised machine learning models were employed to compare the predictive performance of the LCA-derived clusters with the traditional system. While the older classification sometimes achieved slightly higher sensitivity for specific outcomes, the LCA-based approach demonstrated consistently better discriminative ability, as evidenced by a higher ROC AUC score. Moreover, the LCA classes highlighted unique associations between clinical features, histological severity, and autoimmune comorbidities that were less evident under the conventional scheme. Although these results may not represent a major paradigm shift, they strongly underscore the need for more refined, symptom-centric classification methods. The retrospective design and single-country sample limit generalizability, and further prospective, multicenter studies are recommended. Nevertheless, integrating LCA with advanced supervised techniques offers a promising pathway toward enhanced diagnostic precision, more targeted treatment, and a deeper understanding of CD’s heterogeneous clinical spectrum.

Identificazione di Nuovi Sottotipi Clinici della Malattia Celiaca Utilizzando una Combinazione di Metodi di Apprendimento Automatico Non Supervisionato e Supervisionato.

ASADOLLAHZADEH, AMIRHOSSEIN
2023/2024

Abstract

Celiac disease (CD) is a chronic autoimmune disorder triggered by gluten ingestion, leading to a wide spectrum of gastrointestinal and extraintestinal symptoms. Traditionally, CD has been classified into major, minor, and silent categories based on clinical presentation. However, these broad classifications may fail to capture the full heterogeneity of the disease, potentially leading to misdiagnosis or delayed treatment. In this study, latent class analysis (LCA) was employed to identify new CD subtypes solely based on symptomatology, independent of serological or genetic markers. Data from 2,478 adult CD patients, collected from 19 Italian centers between 2011 and 2021, were analyzed, with clinical, laboratory, endoscopic, and histological data being systematically gathered to ensure a comprehensive evaluation of the disease. Four distinct symptom-based clusters were identified using features such as gastrointestinal, hematological, neuropsychiatric manifestations, and fatigue. The optimal number of latent clusters was determined using the Bayesian Information Criterion (BIC). These newly derived classes were compared with the traditional classification using Chi-squared tests. To further validate the new subtypes, supervised machine learning models were applied to assess their predictive performance against the conventional framework. The results indicated that while the LCA-derived subtypes enhanced the differentiation of patient subgroups, their ability to predict disease complications remained comparable to that of the traditional classification system. Although these new subtypes partly overlapped with conventional categories, they also revealed significant discrepancies, including instances where patients with few or no gastrointestinal symptoms were still labeled as “Major” diseases in the traditional classification. Such mismatches indicate that current clinical frameworks may oversimplify CD, potentially overlooking patients who remain susceptible to complications despite minimal symptoms. Such inconsistencies underscore the need for more robust classification methods to enhance diagnostic accuracy and clinical management. While data-driven approaches such as LCA provide valuable insights into CD heterogeneity, integrating additional factors, such as genetic predisposition, serological markers, and environmental influences, could further refine disease classification and improve complication prediction. In addition, to validate these findings, supervised machine learning models were employed to compare the predictive performance of the LCA-derived clusters with the traditional system. While the older classification sometimes achieved slightly higher sensitivity for specific outcomes, the LCA-based approach demonstrated consistently better discriminative ability, as evidenced by a higher ROC AUC score. Moreover, the LCA classes highlighted unique associations between clinical features, histological severity, and autoimmune comorbidities that were less evident under the conventional scheme. Although these results may not represent a major paradigm shift, they strongly underscore the need for more refined, symptom-centric classification methods. The retrospective design and single-country sample limit generalizability, and further prospective, multicenter studies are recommended. Nevertheless, integrating LCA with advanced supervised techniques offers a promising pathway toward enhanced diagnostic precision, more targeted treatment, and a deeper understanding of CD’s heterogeneous clinical spectrum.
2023
Identifying New Clinical Subtypes of Celiac Disease Using a Combination of Unsupervised and Supervised Machine Learning Methods.
File in questo prodotto:
File Dimensione Formato  
Final_thesis.pdf

accesso aperto

Dimensione 1.58 MB
Formato Adobe PDF
1.58 MB Adobe PDF Visualizza/Apri

È consentito all'utente scaricare e condividere i documenti disponibili a testo pieno in UNITESI UNIPV nel rispetto della licenza Creative Commons del tipo CC BY NC ND.
Per maggiori informazioni e per verifiche sull'eventuale disponibilità del file scrivere a: unitesi@unipv.it.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14239/33377