The recent rise of LLMs and their employment in a variety of tasks has revolutionized the field of NLP, leading to great advances also in the study of ancient languages. The lexical-semantic information encoded in WordNets plays an important role in the pre-training and fine-tuning of such models for the improvement in specific down-stream tasks. However, the interchange between WordNets and LLMs has mainly been explored unilaterally, with only sporadic attempts to exploit LLMs to enrich WordNet information and partially automate population, even though LLMs have been successfully employed in enriching other kinds of linguistic resources. This work aims to explore the employment of LLMs, specifically of Mistral-Nemo, in the semi-automatic population of the Ancient Greek WordNet synsets through a task of synonym generation. The experiment proceeds in various steps, exploring increasingly complex approaches, namely zero-shot, few-shots and fine-tuning. The dataset used for fine-tuning has the same structure and format as the data collected in WordNet, as the experiment explores the benefits of a feedback loop aimed at using WordNet structured data to generate new data of the same type. The results are evaluated against an English baseline to highlight difference in performance between a high-resource modern language and Ancient Greek, a low-resource historical language. The results of the experiment show that the zero-shot approach yields the highest accuracy, while fine-tuning leads to the highest number of potential synonyms. The analysis also reveals that polysemy and PoS play a role in the model’s performance, as the highest scores are registered for polysemous words and for verbs and nouns. The outcomes of the experiment are encouraging for the application of such approaches in a human-in-the-loop scenario, since human validation still proves crucial in ensuring the quality and accuracy of the results.

The recent rise of LLMs and their employment in a variety of tasks has revolutionized the field of NLP, leading to great advances also in the study of ancient languages. The lexical-semantic information encoded in WordNets plays an important role in the pre-training and fine-tuning of such models for the improvement in specific down-stream tasks. However, the interchange between WordNets and LLMs has mainly been explored unilaterally, with only sporadic attempts to exploit LLMs to enrich WordNet information and partially automate population, even though LLMs have been successfully employed in enriching other kinds of linguistic resources. This work aims to explore the employment of LLMs, specifically of Mistral-Nemo, in the semi-automatic population of the Ancient Greek WordNet synsets through a task of synonym generation. The experiment proceeds in various steps, exploring increasingly complex approaches, namely zero-shot, few-shots and fine-tuning. The dataset used for fine-tuning has the same structure and format as the data collected in WordNet, as the experiment explores the benefits of a feedback loop aimed at using WordNet structured data to generate new data of the same type. The results are evaluated against an English baseline to highlight difference in performance between a high-resource modern language and Ancient Greek, a low-resource historical language. The results of the experiment show that the zero-shot approach yields the highest accuracy, while fine-tuning leads to the highest number of potential synonyms. The analysis also reveals that polysemy and PoS play a role in the model’s performance, as the highest scores are registered for polysemous words and for verbs and nouns. The outcomes of the experiment are encouraging for the application of such approaches in a human-in-the-loop scenario, since human validation still proves crucial in ensuring the quality and accuracy of the results.

Toward semi-automatic synset population: LLM-based approaches for Ancient Greek WordNet expansion

MARCHESI, BEATRICE
2024/2025

Abstract

The recent rise of LLMs and their employment in a variety of tasks has revolutionized the field of NLP, leading to great advances also in the study of ancient languages. The lexical-semantic information encoded in WordNets plays an important role in the pre-training and fine-tuning of such models for the improvement in specific down-stream tasks. However, the interchange between WordNets and LLMs has mainly been explored unilaterally, with only sporadic attempts to exploit LLMs to enrich WordNet information and partially automate population, even though LLMs have been successfully employed in enriching other kinds of linguistic resources. This work aims to explore the employment of LLMs, specifically of Mistral-Nemo, in the semi-automatic population of the Ancient Greek WordNet synsets through a task of synonym generation. The experiment proceeds in various steps, exploring increasingly complex approaches, namely zero-shot, few-shots and fine-tuning. The dataset used for fine-tuning has the same structure and format as the data collected in WordNet, as the experiment explores the benefits of a feedback loop aimed at using WordNet structured data to generate new data of the same type. The results are evaluated against an English baseline to highlight difference in performance between a high-resource modern language and Ancient Greek, a low-resource historical language. The results of the experiment show that the zero-shot approach yields the highest accuracy, while fine-tuning leads to the highest number of potential synonyms. The analysis also reveals that polysemy and PoS play a role in the model’s performance, as the highest scores are registered for polysemous words and for verbs and nouns. The outcomes of the experiment are encouraging for the application of such approaches in a human-in-the-loop scenario, since human validation still proves crucial in ensuring the quality and accuracy of the results.
2024
Toward semi-automatic synset population: LLM-based approaches for Ancient Greek WordNet expansion
The recent rise of LLMs and their employment in a variety of tasks has revolutionized the field of NLP, leading to great advances also in the study of ancient languages. The lexical-semantic information encoded in WordNets plays an important role in the pre-training and fine-tuning of such models for the improvement in specific down-stream tasks. However, the interchange between WordNets and LLMs has mainly been explored unilaterally, with only sporadic attempts to exploit LLMs to enrich WordNet information and partially automate population, even though LLMs have been successfully employed in enriching other kinds of linguistic resources. This work aims to explore the employment of LLMs, specifically of Mistral-Nemo, in the semi-automatic population of the Ancient Greek WordNet synsets through a task of synonym generation. The experiment proceeds in various steps, exploring increasingly complex approaches, namely zero-shot, few-shots and fine-tuning. The dataset used for fine-tuning has the same structure and format as the data collected in WordNet, as the experiment explores the benefits of a feedback loop aimed at using WordNet structured data to generate new data of the same type. The results are evaluated against an English baseline to highlight difference in performance between a high-resource modern language and Ancient Greek, a low-resource historical language. The results of the experiment show that the zero-shot approach yields the highest accuracy, while fine-tuning leads to the highest number of potential synonyms. The analysis also reveals that polysemy and PoS play a role in the model’s performance, as the highest scores are registered for polysemous words and for verbs and nouns. The outcomes of the experiment are encouraging for the application of such approaches in a human-in-the-loop scenario, since human validation still proves crucial in ensuring the quality and accuracy of the results.
File in questo prodotto:
File Dimensione Formato  
Marchesi - tesi magistrale.pdf

accesso aperto

Descrizione: Tesi magistrale - Marchesi Beatrice
Dimensione 1.46 MB
Formato Adobe PDF
1.46 MB Adobe PDF Visualizza/Apri

È consentito all'utente scaricare e condividere i documenti disponibili a testo pieno in UNITESI UNIPV nel rispetto della licenza Creative Commons del tipo CC BY NC ND.
Per maggiori informazioni e per verifiche sull'eventuale disponibilità del file scrivere a: unitesi@unipv.it.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14239/30567