Improving Ancient Greek Word Embeddings through
Lexical Retrofitting:
A Comparison with Count-Based Models

This study investigates alternative strategies for constructing distributional semantic models (Harris 1954, Lenci 2008) for Ancient Greek, a highly inflected historical language for which existing embeddings remain sparse and semantically inconsistent (Keersmaekers & Speelman 2024; Stopponi et al. 2023). Building on the count-based baseline developed by Keersmaekers and Speelman (2024), the project introduces a new set of dependency-based vectors trained with word2vecf (Levy et al. 2014) in order to assess whether syntactic contexts provide a cleaner and more informative signal than traditional linear windows. The models are trained on the same corpus used in the original study, GLAUx (Keersmaekers 2023), ensuring direct comparability. Starting from the nature of the different embeddings and the fact that they tend to show different semantic peculiarities (Lenci & Sahlgren 2025; Stopponi et al. 2023), the first research question (RQ1) examines whether dependency-based embeddings can improve the representation of functional similarity while reducing the noise typical of window-based models (cf. Lenci & Sahlgren). The second part of the project (RQ2) investigates whether injecting symbolic knowledge can further refine the semantic structure of the vectors. Given the limited coverage of the Ancient Greek WordNet (Bizzoni et al. 2014), the study applies the retrofitting technique developed by Faruqui et al. (2015), using curated semantic pairs from Marchesi (2025) as a controlled source of lexical relations to to post-process the distributional vectors. Both the new dependency-based vectors and the count-based baseline are evaluated through intrinsic and extrinsic tasks. The comparison between models, before and after retrofitting, aims to clarify: (i) whether dependency-based contexts systematically outperform linear contexts in a morphologically rich, word-order–flexible language; (ii) whether symbolic constraints can compensate for data sparsity and improve the semantic coherence of the embeddings. Overall, this work tries to provide the first systematic comparison between count-based, dependency-based, and retrofitted models for Ancient Greek, and introduces a framework that can be applied to other low-resource and historical languages.

Improving Ancient Greek Word Embeddings through Lexical Retrofitting: A Comparison with Count-Based Models

REINA, LORENZO

2024/2025

Abstract

This study investigates alternative strategies for constructing distributional semantic models (Harris 1954, Lenci 2008) for Ancient Greek, a highly inflected historical language for which existing embeddings remain sparse and semantically inconsistent (Keersmaekers & Speelman 2024; Stopponi et al. 2023). Building on the count-based baseline developed by Keersmaekers and Speelman (2024), the project introduces a new set of dependency-based vectors trained with word2vecf (Levy et al. 2014) in order to assess whether syntactic contexts provide a cleaner and more informative signal than traditional linear windows. The models are trained on the same corpus used in the original study, GLAUx (Keersmaekers 2023), ensuring direct comparability. Starting from the nature of the different embeddings and the fact that they tend to show different semantic peculiarities (Lenci & Sahlgren 2025; Stopponi et al. 2023), the first research question (RQ1) examines whether dependency-based embeddings can improve the representation of functional similarity while reducing the noise typical of window-based models (cf. Lenci & Sahlgren). The second part of the project (RQ2) investigates whether injecting symbolic knowledge can further refine the semantic structure of the vectors. Given the limited coverage of the Ancient Greek WordNet (Bizzoni et al. 2014), the study applies the retrofitting technique developed by Faruqui et al. (2015), using curated semantic pairs from Marchesi (2025) as a controlled source of lexical relations to to post-process the distributional vectors. Both the new dependency-based vectors and the count-based baseline are evaluated through intrinsic and extrinsic tasks. The comparison between models, before and after retrofitting, aims to clarify: (i) whether dependency-based contexts systematically outperform linear contexts in a morphologically rich, word-order–flexible language; (ii) whether symbolic constraints can compensate for data sparsity and improve the semantic coherence of the embeddings. Overall, this work tries to provide the first systematic comparison between count-based, dependency-based, and retrofitted models for Ancient Greek, and introduces a framework that can be applied to other low-resource and historical languages.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				DIPARTIMENTO DI STUDI UMANISTICI
			
	Corso di studio
	
				LINGUISTICA TEORICA, APPLICATA E DELLE LINGUE MODERNE [05409]
			
	Anno Accademico
	
				2024
			
	Titolo inglese
	
				Improving Ancient Greek Word Embeddings through
Lexical Retrofitting:
A Comparison with Count-Based Models
			
	Abstract in italiano
	
				This study investigates alternative strategies for constructing distributional semantic models (Harris
1954, Lenci 2008) for Ancient Greek, a highly inflected historical language for which existing
embeddings remain sparse and semantically inconsistent (Keersmaekers & Speelman 2024; Stopponi
et al. 2023).
Building on the count-based baseline developed by Keersmaekers and Speelman (2024), the
project introduces a new set of dependency-based vectors trained with word2vecf (Levy et al. 2014)
in order to assess whether syntactic contexts provide a cleaner and more informative signal than
traditional linear windows. The models are trained on the same corpus used in the original study,
GLAUx (Keersmaekers 2023), ensuring direct comparability.
Starting from the nature of the different embeddings and the fact that they tend to show
different semantic peculiarities (Lenci & Sahlgren 2025; Stopponi et al. 2023), the first research
question (RQ1) examines whether dependency-based embeddings can improve the representation of
functional similarity while reducing the noise typical of window-based models (cf. Lenci &
Sahlgren).
The second part of the project (RQ2) investigates whether injecting symbolic knowledge can
further refine the semantic structure of the vectors. Given the limited coverage of the Ancient Greek
WordNet (Bizzoni et al. 2014), the study applies the retrofitting technique developed by Faruqui et
al. (2015), using curated semantic pairs from Marchesi (2025) as a controlled source of lexical
relations to to post-process the distributional vectors.
Both the new dependency-based vectors and the count-based baseline are evaluated through
intrinsic and extrinsic tasks. The comparison between models, before and after retrofitting, aims to
clarify:
(i) whether dependency-based contexts systematically outperform linear contexts in a
morphologically rich, word-order–flexible language;
(ii) whether symbolic constraints can compensate for data sparsity and improve the semantic
coherence of the embeddings.
Overall, this work tries to provide the first systematic comparison between count-based,
dependency-based, and retrofitted models for Ancient Greek, and introduces a framework that can be
applied to other low-resource and historical languages.
			
	Relatore
	
				ZANCHI, CHIARA
			
	Correlatore
	
				KEERSMAEKERS, ALEK
			
	Appare nelle tipologie:
	
				Lauree Magistrali

File in questo prodotto:

File	Dimensione	Formato
Reina_540008.pdf non disponibili Dimensione 2.26 MB Formato Adobe PDF Richiedi una copia	2.26 MB	Adobe PDF	Richiedi una copia

È consentito all'utente scaricare e condividere i documenti disponibili a testo pieno in UNITESI UNIPV nel rispetto della licenza Creative Commons del tipo CC BY NC ND.
Per maggiori informazioni e per verifiche sull'eventuale disponibilità del file scrivere a: unitesi@unipv.it.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14239/34388