In this thesis, we induce semantic classes of verbs using clustering. The aim is to explore the role of linguistic annotations about the syntax and the semantics of verbs, and more specifically (i) if vectors enriched with annotation syntactic-semantic features perform better in the clustering task than vectors obtained with distributional systems and (ii) the weight of syntactic-semantic features such as argument structure, selectional preferences and subcategorization frame in the way humans outline verb classes. To explore these questions, we design a model that works in two phases, preprocessing and clustering: first vectors are created through different strategies, then vectors are clustered through k-means. The baseline uses FastText embeddings, that rely on distribution only. The other strategies aim at enriching vectors with syntactic and semantic features from T-PAS, a resource for Italian verbs. The first preprocessing strategy extracts BERT embeddings from the T-PAS annotated corpus. The second strategy uses the T-PAS annotations alone, encoded into vectors through rule-based algorithms without further distributional support. The results are evaluated with an evaluation benchmark as it results from a verb clustering experiment performed by humans. The first preprocessing strategy is the best performing, revealing the power of transformers fine-tuned on linguistically annotated corpora, while both the baseline and the first preprocessing strategy outperformed the second preprocessing strategy. This result enlightens the power of distributional systems while compared with systems that rely on annotations only, but the low performances of the annotation-only systems may also lead to the conclusion that syntax and semantics features do not weight much in the way humans outline verb classes. However, a further data analysis reveals that the features do not correlate with the classes of the evaluation dataset, showing that the methodology used by the authors of the experiments do not give a satisfactory account of semantic compositionality in verb meaning.
VERB CLASSES IN THE TIME OF CLUSTERING: ENRICHING DISTRIBUTIONAL VECTORS WITH LINGUISTIC KNOWLEDGE FOR SEMANTIC CLASS INDUCTION
RICCHIARDI, MARTA
2021/2022
Abstract
In this thesis, we induce semantic classes of verbs using clustering. The aim is to explore the role of linguistic annotations about the syntax and the semantics of verbs, and more specifically (i) if vectors enriched with annotation syntactic-semantic features perform better in the clustering task than vectors obtained with distributional systems and (ii) the weight of syntactic-semantic features such as argument structure, selectional preferences and subcategorization frame in the way humans outline verb classes. To explore these questions, we design a model that works in two phases, preprocessing and clustering: first vectors are created through different strategies, then vectors are clustered through k-means. The baseline uses FastText embeddings, that rely on distribution only. The other strategies aim at enriching vectors with syntactic and semantic features from T-PAS, a resource for Italian verbs. The first preprocessing strategy extracts BERT embeddings from the T-PAS annotated corpus. The second strategy uses the T-PAS annotations alone, encoded into vectors through rule-based algorithms without further distributional support. The results are evaluated with an evaluation benchmark as it results from a verb clustering experiment performed by humans. The first preprocessing strategy is the best performing, revealing the power of transformers fine-tuned on linguistically annotated corpora, while both the baseline and the first preprocessing strategy outperformed the second preprocessing strategy. This result enlightens the power of distributional systems while compared with systems that rely on annotations only, but the low performances of the annotation-only systems may also lead to the conclusion that syntax and semantics features do not weight much in the way humans outline verb classes. However, a further data analysis reveals that the features do not correlate with the classes of the evaluation dataset, showing that the methodology used by the authors of the experiments do not give a satisfactory account of semantic compositionality in verb meaning.È consentito all'utente scaricare e condividere i documenti disponibili a testo pieno in UNITESI UNIPV nel rispetto della licenza Creative Commons del tipo CC BY NC ND.
Per maggiori informazioni e per verifiche sull'eventuale disponibilità del file scrivere a: unitesi@unipv.it.
https://hdl.handle.net/20.500.14239/2418