This study investigates alternative strategies for constructing distributional semantic models (Harris 1954, Lenci 2008) for Ancient Greek, a highly inflected historical language for which existing embeddings remain sparse and semantically inconsistent (Keersmaekers & Speelman 2024; Stopponi et al. 2023). Building on the count-based baseline developed by Keersmaekers and Speelman (2024), the project introduces a new set of dependency-based vectors trained with word2vecf (Levy et al. 2014) in order to assess whether syntactic contexts provide a cleaner and more informative signal than traditional linear windows. The models are trained on the same corpus used in the original study, GLAUx (Keersmaekers 2023), ensuring direct comparability. Starting from the nature of the different embeddings and the fact that they tend to show different semantic peculiarities (Lenci & Sahlgren 2025; Stopponi et al. 2023), the first research question (RQ1) examines whether dependency-based embeddings can improve the representation of functional similarity while reducing the noise typical of window-based models (cf. Lenci & Sahlgren). The second part of the project (RQ2) investigates whether injecting symbolic knowledge can further refine the semantic structure of the vectors. Given the limited coverage of the Ancient Greek WordNet (Bizzoni et al. 2014), the study applies the retrofitting technique developed by Faruqui et al. (2015), using curated semantic pairs from Marchesi (2025) as a controlled source of lexical relations to to post-process the distributional vectors. Both the new dependency-based vectors and the count-based baseline are evaluated through intrinsic and extrinsic tasks. The comparison between models, before and after retrofitting, aims to clarify: (i) whether dependency-based contexts systematically outperform linear contexts in a morphologically rich, word-order–flexible language; (ii) whether symbolic constraints can compensate for data sparsity and improve the semantic coherence of the embeddings. Overall, this work tries to provide the first systematic comparison between count-based, dependency-based, and retrofitted models for Ancient Greek, and introduces a framework that can be applied to other low-resource and historical languages.

This study investigates alternative strategies for constructing distributional semantic models (Harris 1954, Lenci 2008) for Ancient Greek, a highly inflected historical language for which existing embeddings remain sparse and semantically inconsistent (Keersmaekers & Speelman 2024; Stopponi et al. 2023). Building on the count-based baseline developed by Keersmaekers and Speelman (2024), the project introduces a new set of dependency-based vectors trained with word2vecf (Levy et al. 2014) in order to assess whether syntactic contexts provide a cleaner and more informative signal than traditional linear windows. The models are trained on the same corpus used in the original study, GLAUx (Keersmaekers 2023), ensuring direct comparability. Starting from the nature of the different embeddings and the fact that they tend to show different semantic peculiarities (Lenci & Sahlgren 2025; Stopponi et al. 2023), the first research question (RQ1) examines whether dependency-based embeddings can improve the representation of functional similarity while reducing the noise typical of window-based models (cf. Lenci & Sahlgren). The second part of the project (RQ2) investigates whether injecting symbolic knowledge can further refine the semantic structure of the vectors. Given the limited coverage of the Ancient Greek WordNet (Bizzoni et al. 2014), the study applies the retrofitting technique developed by Faruqui et al. (2015), using curated semantic pairs from Marchesi (2025) as a controlled source of lexical relations to to post-process the distributional vectors. Both the new dependency-based vectors and the count-based baseline are evaluated through intrinsic and extrinsic tasks. The comparison between models, before and after retrofitting, aims to clarify: (i) whether dependency-based contexts systematically outperform linear contexts in a morphologically rich, word-order–flexible language; (ii) whether symbolic constraints can compensate for data sparsity and improve the semantic coherence of the embeddings. Overall, this work tries to provide the first systematic comparison between count-based, dependency-based, and retrofitted models for Ancient Greek, and introduces a framework that can be applied to other low-resource and historical languages.

Improving Ancient Greek Word Embeddings through Lexical Retrofitting: A Comparison with Count-Based Models

REINA, LORENZO
2024/2025

Abstract

This study investigates alternative strategies for constructing distributional semantic models (Harris 1954, Lenci 2008) for Ancient Greek, a highly inflected historical language for which existing embeddings remain sparse and semantically inconsistent (Keersmaekers & Speelman 2024; Stopponi et al. 2023). Building on the count-based baseline developed by Keersmaekers and Speelman (2024), the project introduces a new set of dependency-based vectors trained with word2vecf (Levy et al. 2014) in order to assess whether syntactic contexts provide a cleaner and more informative signal than traditional linear windows. The models are trained on the same corpus used in the original study, GLAUx (Keersmaekers 2023), ensuring direct comparability. Starting from the nature of the different embeddings and the fact that they tend to show different semantic peculiarities (Lenci & Sahlgren 2025; Stopponi et al. 2023), the first research question (RQ1) examines whether dependency-based embeddings can improve the representation of functional similarity while reducing the noise typical of window-based models (cf. Lenci & Sahlgren). The second part of the project (RQ2) investigates whether injecting symbolic knowledge can further refine the semantic structure of the vectors. Given the limited coverage of the Ancient Greek WordNet (Bizzoni et al. 2014), the study applies the retrofitting technique developed by Faruqui et al. (2015), using curated semantic pairs from Marchesi (2025) as a controlled source of lexical relations to to post-process the distributional vectors. Both the new dependency-based vectors and the count-based baseline are evaluated through intrinsic and extrinsic tasks. The comparison between models, before and after retrofitting, aims to clarify: (i) whether dependency-based contexts systematically outperform linear contexts in a morphologically rich, word-order–flexible language; (ii) whether symbolic constraints can compensate for data sparsity and improve the semantic coherence of the embeddings. Overall, this work tries to provide the first systematic comparison between count-based, dependency-based, and retrofitted models for Ancient Greek, and introduces a framework that can be applied to other low-resource and historical languages.
2024
Improving Ancient Greek Word Embeddings through Lexical Retrofitting: A Comparison with Count-Based Models
This study investigates alternative strategies for constructing distributional semantic models (Harris 1954, Lenci 2008) for Ancient Greek, a highly inflected historical language for which existing embeddings remain sparse and semantically inconsistent (Keersmaekers & Speelman 2024; Stopponi et al. 2023). Building on the count-based baseline developed by Keersmaekers and Speelman (2024), the project introduces a new set of dependency-based vectors trained with word2vecf (Levy et al. 2014) in order to assess whether syntactic contexts provide a cleaner and more informative signal than traditional linear windows. The models are trained on the same corpus used in the original study, GLAUx (Keersmaekers 2023), ensuring direct comparability. Starting from the nature of the different embeddings and the fact that they tend to show different semantic peculiarities (Lenci & Sahlgren 2025; Stopponi et al. 2023), the first research question (RQ1) examines whether dependency-based embeddings can improve the representation of functional similarity while reducing the noise typical of window-based models (cf. Lenci & Sahlgren). The second part of the project (RQ2) investigates whether injecting symbolic knowledge can further refine the semantic structure of the vectors. Given the limited coverage of the Ancient Greek WordNet (Bizzoni et al. 2014), the study applies the retrofitting technique developed by Faruqui et al. (2015), using curated semantic pairs from Marchesi (2025) as a controlled source of lexical relations to to post-process the distributional vectors. Both the new dependency-based vectors and the count-based baseline are evaluated through intrinsic and extrinsic tasks. The comparison between models, before and after retrofitting, aims to clarify: (i) whether dependency-based contexts systematically outperform linear contexts in a morphologically rich, word-order–flexible language; (ii) whether symbolic constraints can compensate for data sparsity and improve the semantic coherence of the embeddings. Overall, this work tries to provide the first systematic comparison between count-based, dependency-based, and retrofitted models for Ancient Greek, and introduces a framework that can be applied to other low-resource and historical languages.
File in questo prodotto:
File Dimensione Formato  
Reina_540008.pdf

non disponibili

Dimensione 2.26 MB
Formato Adobe PDF
2.26 MB Adobe PDF   Richiedi una copia

È consentito all'utente scaricare e condividere i documenti disponibili a testo pieno in UNITESI UNIPV nel rispetto della licenza Creative Commons del tipo CC BY NC ND.
Per maggiori informazioni e per verifiche sull'eventuale disponibilità del file scrivere a: unitesi@unipv.it.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14239/34388