Topic Modeling applicato alle lyrics:analisi linguistica

This thesis, linguistic for interest and topic, fits into the field of Natural language Processing and it aims, proposing a linguistic analysis of a dataset made of Italian lyrics, at extracting and investigating the semantic content of the lyrics by applying a novel topic modeling technique: BERTopic developed in 2020 by M. Grootendorst. In addition to that, it problematizes the difficulties encountered during the evaluation and it suggests the possibility of discovering songs on the basis of their topic and therefore the possibility of generating content-based playlists. After a brief presentation of the reasons that induced me to the choice of the topic, the following chapter will frame this dissertation within the field of Music Information Retrieval of which a brief but exhaustive overview will be provided. Then, the thesis proposes a brief overview of Machine Learning and the traditional distinction between supervised and unsupervised. Within the unsupervised machine learning, the thesis proposes a description of topic modeling and the techniques related to this task: the chapter will be focused on the algorithm implemented in the present investigation which is BERTopic. Then a whole chapter will be devoted to the description of the dataset with its main features. All the experiments carried out will be explored and presented before the interpretation and discussion of the results. The results will be analyzed in order to carry out a qualitative evaluation of the algorithm, highlighting difficulties and perspectives. The last chapter of the thesis is devoted to the final considerations and it proposes interesting future insights.

La presente tesi, linguistica per interesse e contenuto, si inserisce nel campo di applicazione del Natural Language Processing (NLP) e attraverso una analisi linguistica di un dataset costituito dalle lyrics di canzoni in lingua italiana, ambisce innanzitutto ad estrarre ed indagare il contenuto semantico delle lyrics applicando un algoritmo di Topic Modeling: BERTopic. Problematizza inoltre i vari elementi di difficoltà riscontrati nella fase di valutazione e suggerisce infine, a livello applicativo, la possibilità di scovare brani secondo una ricerca topic-based e di generare playlist content-based. Dopo una breve presentazione delle motivazioni che hanno spinto chi scrive alla scelta dell’argomento dell’indagine, il capitolo seguente andrà ad inquadrare il presente lavoro all’interno dell’ambito di ricerca del MIR di cui verrà fornita una breve ma esaustiva panoramica. Verrà poi offerta una panoramica del machine learning (ML) e la tradizionale distinzione tra supervised e unsupervised ML. All’interno del quadro del ML non supervisionato, viene prima introdotto il topic modeling e vengono poi descritte le tecnologie per risolvere tale task: verrà rivolta particolare attenzione alla descrizione dell'algoritmo utilizzato durante l’implementazione della soluzione ovvero BERTopic, un algoritmo per la generazione di topic sviluppato nel 2020 da Maarten Grootendorst. Nella sezione successiva verrà descritto, da un punto di vista tecnico, il processo di raccolta dei dati che costruiscono il dataset creato ad hoc ai fini dell’indagine di cui saranno presentate le caratteristiche principali. Verrà poi descritta attentamente la parte sperimentale della tesi in tutti i suoi passaggi. La sezione seguente consiste nell'interpretazione e nella discussione dei risultati, i quali verranno analizzati al fine di effettuare una valutazione delle prestazioni del sistema mettendo in luce sia i vantaggi sia i problemi riscontrati. L’ultima sezione della tesi è dedicata alle considerazioni finali e ambisce a proporre possibili miglioramenti e spunti di ricerca futuri.