Leveraging AWS Glue to Implement Spark-based ETL Jobs for a Data Warehousing Solution

The COVID-19 pandemic has been an unexpected calamity against the global population of 2019-2021: governments were unprepared to tackle the effects and consequences of an epidemic of global scale, healthcare systems could not sustain the impact due to a lack of hospital beds and ICUs, economies shifted, ordinary people had to adapt to a new way of living. Soon, it became apparent that governments required information on which to base their decisions regarding the best ways to contain the spread of the virus and limit the detrimental effects of the outbreak. Epidemics with similar impact have been very rare in the last centuries, therefore no recorded data to aid administrative bodies are easily retrievable. The PERISCOPE Project aims to provide such information, suitably represented via analyses, predictive models, and a WebGIS Atlas, in order to understand the dynamics and the consequences of the pandemic for the purpose of guiding policy makers in their actions. This thesis focuses on a crucial segment of the design envisioned by the PERISCOPE team, which is ETL: the data are extracted from heterogeneous sources, processed through the distributed analytics engine Apache Spark on Amazon's cloud ETL service AWS Glue, and loaded onto a suitable database solution for the analysis components to rely on. A data warehouse implementation is introduced to support ETL and PERISCOPE's analytics purposes. The dissertation begins with a description of the effects of the COVID-19 pandemic and PERISCOPE's goals. The next part introduces various state-of-the-art solutions to the issues tackled by PERISCOPE, including effective COVID-19 data collection, distributed data storage and processing, and atlas modelling. The chapter about the methodology provides a thorough description of the data sources employed in the project and of the data warehouse architecture, while also introducing the basic structure of the AWS Glue ETL processes. The fourth chapter is dedicated to technology: it defines the tools, components, and means that have been employed to support PERISCOPE, including cloud storage, cloud development tools, and the DBMS. The following chapter describes the solutions that have been adopted to implement ETL jobs and import information on the Atlas, with a focus on database changes and data transformation processes. Finally, the conclusions are laid out, with a final prospect on future developments.

Leveraging AWS Glue to Implement Spark-based ETL Jobs for a Data Warehousing Solution

GENTILINI, SERGIO

2020/2021

Abstract

The COVID-19 pandemic has been an unexpected calamity against the global population of 2019-2021: governments were unprepared to tackle the effects and consequences of an epidemic of global scale, healthcare systems could not sustain the impact due to a lack of hospital beds and ICUs, economies shifted, ordinary people had to adapt to a new way of living. Soon, it became apparent that governments required information on which to base their decisions regarding the best ways to contain the spread of the virus and limit the detrimental effects of the outbreak. Epidemics with similar impact have been very rare in the last centuries, therefore no recorded data to aid administrative bodies are easily retrievable. The PERISCOPE Project aims to provide such information, suitably represented via analyses, predictive models, and a WebGIS Atlas, in order to understand the dynamics and the consequences of the pandemic for the purpose of guiding policy makers in their actions. This thesis focuses on a crucial segment of the design envisioned by the PERISCOPE team, which is ETL: the data are extracted from heterogeneous sources, processed through the distributed analytics engine Apache Spark on Amazon's cloud ETL service AWS Glue, and loaded onto a suitable database solution for the analysis components to rely on. A data warehouse implementation is introduced to support ETL and PERISCOPE's analytics purposes. The dissertation begins with a description of the effects of the COVID-19 pandemic and PERISCOPE's goals. The next part introduces various state-of-the-art solutions to the issues tackled by PERISCOPE, including effective COVID-19 data collection, distributed data storage and processing, and atlas modelling. The chapter about the methodology provides a thorough description of the data sources employed in the project and of the data warehouse architecture, while also introducing the basic structure of the AWS Glue ETL processes. The fourth chapter is dedicated to technology: it defines the tools, components, and means that have been employed to support PERISCOPE, including cloud storage, cloud development tools, and the DBMS. The following chapter describes the solutions that have been adopted to implement ETL jobs and import information on the Atlas, with a focus on database changes and data transformation processes. Finally, the conclusions are laid out, with a final prospect on future developments.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				DIPARTIMENTO DI INGEGNERIA INDUSTRIALE E DELL'INFORMAZIONE
			
	Corso di studio
	
				COMPUTER ENGINEERING [06415]
			
	Anno Accademico
	
				2020
			
	Titolo inglese
	
				Leveraging AWS Glue to Implement Spark-based ETL Jobs for a Data Warehousing Solution
			
	Abstract in italiano
	
				The COVID-19 pandemic has been an unexpected calamity against the global population of 2019-2021: governments were unprepared to tackle the effects and consequences of an epidemic of global scale, healthcare systems could not sustain the impact due to a lack of hospital beds and ICUs, economies shifted, ordinary people had to adapt to a new way of living. Soon, it became apparent that governments required information on which to base their decisions regarding the best ways to contain the spread of the virus and limit the detrimental effects of the outbreak. Epidemics with similar impact have been very rare in the last centuries, therefore no recorded data to aid administrative bodies are easily retrievable. The PERISCOPE Project aims to provide such information, suitably represented via analyses, predictive models, and a WebGIS Atlas, in order to understand the dynamics and the consequences of the pandemic for the purpose of guiding policy makers in their actions.

This thesis focuses on a crucial segment of the design envisioned by the PERISCOPE team, which is ETL: the data are extracted from heterogeneous sources, processed through the distributed analytics engine Apache Spark on Amazon's cloud ETL service AWS Glue, and loaded onto a suitable database solution for the analysis components to rely on. A data warehouse implementation is introduced to support ETL and PERISCOPE's analytics purposes.

The dissertation begins with a description of the effects of the COVID-19 pandemic and PERISCOPE's goals. The next part introduces various state-of-the-art solutions to the issues tackled by PERISCOPE, including effective COVID-19 data collection, distributed data storage and processing, and atlas modelling. The chapter about the methodology provides a thorough description of the data sources employed in the project and of the data warehouse architecture, while also introducing the basic structure of the AWS Glue ETL processes. The fourth chapter is dedicated to technology: it defines the tools, components, and means that have been employed to support PERISCOPE, including cloud storage, cloud development tools, and the DBMS. The following chapter describes the solutions that have been adopted to implement ETL jobs and import information on the Atlas, with a focus on database changes and data transformation processes. Finally, the conclusions are laid out, with a final prospect on future developments.
			
	Relatore
	
				NOCERA, ANTONINO
LARIZZA, CRISTIANA
			
	Appare nelle tipologie:
	
				Lauree Magistrali

File in questo prodotto:

Non ci sono file associati a questo prodotto.

È consentito all'utente scaricare e condividere i documenti disponibili a testo pieno in UNITESI UNIPV nel rispetto della licenza Creative Commons del tipo CC BY NC ND.
Per maggiori informazioni e per verifiche sull'eventuale disponibilità del file scrivere a: unitesi@unipv.it.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14239/13767