In the context of growing adoption of data-driven architectures and the need to process increasingly large, variable and complex data flows, this thesis analyzes in a systematic way the behavior of a streaming pipeline based on Spark Structured Streaming, Azure Event Hubs and Azure Data Lake Gen2, comparing two different strategies of handling data: Delta Lake and Parquet. The pipeline is structured following the Medallion Architecture (Bronze, Silver, Gold and GoldML) and elaborates the simulated stream derived from the dataset IEEE-CIS Fraud Detection and includes ingestion operations by micro-batch, cleaning, feature reduction through PCA, hybrid stream-batch join and inference through an MLP model. The objective is twofold: evaluate the real behavior of the two formats under increasing volumes and define criteria to guide the choice of format and the most appropriate configurations. The experimental analysis is articulated in two phases: (1) baseline scenario in which the two formats are evaluated under increasing volumes and identical operative conditions; (2) optimized scenario, which measures the impact of critical parameters of the pipeline (shuffle partitions, trigger interval, Delta Optimize/AutoCompact). For each execution different metrics have been collected for each layer. The results show that Delta Lake is more suitable for dynamic scenarios that include iterative join, deduplication and incremental updates, guaranteeing greater stability and increasingly better performance as the volume grows. Parquet is confirmed as a lighter format and more efficient in terms of storage, ideal for scenarios of reading, archiving and interoperability. The analysis shows how the configuration of the pipeline can significantly change the behavior of the two formats. The thesis proposes a decisional matrix that links the pipeline’s operational objectives (latency, throughput, file quality and storage used) with the most suitable configuration for different input formats and volumes. This work provides a methodological and applicative contribution to support informed decisions in real-world enterprise contexts.
In un contesto di crescita nella diffusione di architetture data-driven e dalla necessità di dover elaborare flussi di dati altamente variabili, questa tesi analizza il comportamento di una pipeline di streaming basata su tecnologie quali Spark Structured Streaming, Azure Event Hubs and Azure Data Lake Gen2, confrontando due diversi formati di dato: Delta Lake e Parquet. La pipeline è strutturata secondo l’architettura Medallion (Bronze, Silver, Gold e GoldML) ed elabora un flusso di dati simulato, partendo dal dataset IEE-CIS Fraud Detection. La pipeline integra operazioni di ingestione in micro-batch, pulizia, riduzione della dimensionalità, join ibridi e inferenza tramite un modello MLP. L’obiettivo è duplice: valutare il comportamento dei due formati con dei volumi crescenti e andare a definire dei criteri per guidare la scelta del formato e della configurazione più adatta. L’analisi è stata articolata in due fasi: (1) baseline scenario, valuta Delta Lake e Parquet in condizioni operative identiche; (2) scenario ottimizzato, dove si misurano gli impatti di parametri della pipeline quali shuffle partions, trigger interval e ottimizzazioni native di Delta Lake. Per ogni layer e per ogni esecuzione sono state raccolte delle metriche successivamente analizzate. I risultati mostrano che Delta Lake risulta più adatto a scenari che includono join iterative, deduplicazioni e aggiornamenti incrementali, garantendo una maggiore stabilità e migliori complessive prestazioni all’aumentare del carico. Parquet si conferma un formato più leggero ed efficiente in termini di storage utilizzato, ideale per scenari di lettura e archiviazione. L’analisi degli scenari ottimizzati mostra come le configurazioni della pipeline possano modificare il comportamento dei due formati. Infine, la tesi propone una matrice decisionale che ha lo scopo di collegare degli obiettivi operativi della pipeline (latenza, throughput, storage utilizzato e qualità dei file prodotti) suggerendo le configurazioni più adattate al volume di input e formato. La tesi fornisce un contributo applicativo e metodologico atto a supportare scelte informate in contesti di lavoro reali.
Valutazione delle Performance e Strategie di Ottimizzazione in Pipeline di Streaming su Azure: Confronto tra Delta Lake (Lakehouse) e Parquet (Data Lake)
ORSI, MARIANNA
2024/2025
Abstract
In the context of growing adoption of data-driven architectures and the need to process increasingly large, variable and complex data flows, this thesis analyzes in a systematic way the behavior of a streaming pipeline based on Spark Structured Streaming, Azure Event Hubs and Azure Data Lake Gen2, comparing two different strategies of handling data: Delta Lake and Parquet. The pipeline is structured following the Medallion Architecture (Bronze, Silver, Gold and GoldML) and elaborates the simulated stream derived from the dataset IEEE-CIS Fraud Detection and includes ingestion operations by micro-batch, cleaning, feature reduction through PCA, hybrid stream-batch join and inference through an MLP model. The objective is twofold: evaluate the real behavior of the two formats under increasing volumes and define criteria to guide the choice of format and the most appropriate configurations. The experimental analysis is articulated in two phases: (1) baseline scenario in which the two formats are evaluated under increasing volumes and identical operative conditions; (2) optimized scenario, which measures the impact of critical parameters of the pipeline (shuffle partitions, trigger interval, Delta Optimize/AutoCompact). For each execution different metrics have been collected for each layer. The results show that Delta Lake is more suitable for dynamic scenarios that include iterative join, deduplication and incremental updates, guaranteeing greater stability and increasingly better performance as the volume grows. Parquet is confirmed as a lighter format and more efficient in terms of storage, ideal for scenarios of reading, archiving and interoperability. The analysis shows how the configuration of the pipeline can significantly change the behavior of the two formats. The thesis proposes a decisional matrix that links the pipeline’s operational objectives (latency, throughput, file quality and storage used) with the most suitable configuration for different input formats and volumes. This work provides a methodological and applicative contribution to support informed decisions in real-world enterprise contexts.| File | Dimensione | Formato | |
|---|---|---|---|
|
Tesi Marianna Orsi 2.pdf
accesso aperto
Dimensione
8.11 MB
Formato
Adobe PDF
|
8.11 MB | Adobe PDF | Visualizza/Apri |
È consentito all'utente scaricare e condividere i documenti disponibili a testo pieno in UNITESI UNIPV nel rispetto della licenza Creative Commons del tipo CC BY NC ND.
Per maggiori informazioni e per verifiche sull'eventuale disponibilità del file scrivere a: unitesi@unipv.it.
https://hdl.handle.net/20.500.14239/33640