Universal Dependencies is a is a project that is developing cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. In 2016, when this set already included some classical languages such as Ancient Greek, Latin, Gothic and Old Church Slavonic, a small treebank of Sanskrit was built by Dan Zeman, from the Institute of Formal and Applied Linguistics (ÚFAL). The corpus is based on Pañcatantra, a collection of interrelated animal fables in Sanskrit verse and prose. Only 230 sentences of it have been morphologically and syntactically analyzed and, before the annotation goes on, a suitable documentation - a summary of language-specific features and annotation solutions to certain phenomena that cannot be specified universally - has to be provided for Sanskrit. This thesis deals with these peculiarities, with the goal of finding consistent solutions which will permit other texts to be analyzed and the treebank to grow. Among peculiarities of Sanskrit, one is the non-trivial word segmentation. Due to morphophonological changes known as sandhi, Sanskrit texts present long unities which can consist of a single syntactical word or many. This issue directly addresses syntax when these long unities constitute (or contain) compound words. While UD guidelines prescript compound words to be regarded as single words, it is well known that a complex classification was developed by Indian grammarians on the base of which elements make up the compound. Moreover, compounding being an extremely productive phenomenon in Sanskrit, elements of a compound can be in any kind of syntactical relation between each other and with elements outside the compound (Gillon, 1994), thus conveying meaning which in other languages would be rendered by means of phrases or even clauses. After a brief survey of Sanskrit compounds from a linguistic as well as a computational point of view (Lowe, 2015; Scharf, 2015), I discuss the decision I made to internally analyze every compound, i.e. to treat one element as the head, by which other elements depend. While this solution should also provide Sanskrit scholars with an exhaustive description, it is not free from problems and every meaningful case will be accounted for. Looking at morphology, other issues are part-of-speech tagging of some words categories such as participles, pronominal words and adverbs, for choosing their lexical category has implications on determining their lemma and the syntactic relations they bear. Finally, some syntactic constructions are object of discussion. Among them, the preference which Sanskrit shows for sentences with no finite verb form and relative-correlative constructions. Every treebank has a set of guidelines and a documentation page which guide other annotators in their work and enable the corpus to grow. The guidelines help the solution of language specific issues that are not to be found cross-linguistically or don’t appear in English, the language on which most of the example in UD are based. In giving my contribution in writing a documentation for Sanskrit, my aim is to cover as many cases as possible, so that other kinds of text will one day be feasible to enter the corpus. Finally, in order to maximize parallelism, at every stage I compare my solutions with those taken for other classical languages.
A Dependency Treebank for Classical Sanskrit.
BIAGETTI, ERICA
2017/2018
Abstract
Universal Dependencies is a is a project that is developing cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. In 2016, when this set already included some classical languages such as Ancient Greek, Latin, Gothic and Old Church Slavonic, a small treebank of Sanskrit was built by Dan Zeman, from the Institute of Formal and Applied Linguistics (ÚFAL). The corpus is based on Pañcatantra, a collection of interrelated animal fables in Sanskrit verse and prose. Only 230 sentences of it have been morphologically and syntactically analyzed and, before the annotation goes on, a suitable documentation - a summary of language-specific features and annotation solutions to certain phenomena that cannot be specified universally - has to be provided for Sanskrit. This thesis deals with these peculiarities, with the goal of finding consistent solutions which will permit other texts to be analyzed and the treebank to grow. Among peculiarities of Sanskrit, one is the non-trivial word segmentation. Due to morphophonological changes known as sandhi, Sanskrit texts present long unities which can consist of a single syntactical word or many. This issue directly addresses syntax when these long unities constitute (or contain) compound words. While UD guidelines prescript compound words to be regarded as single words, it is well known that a complex classification was developed by Indian grammarians on the base of which elements make up the compound. Moreover, compounding being an extremely productive phenomenon in Sanskrit, elements of a compound can be in any kind of syntactical relation between each other and with elements outside the compound (Gillon, 1994), thus conveying meaning which in other languages would be rendered by means of phrases or even clauses. After a brief survey of Sanskrit compounds from a linguistic as well as a computational point of view (Lowe, 2015; Scharf, 2015), I discuss the decision I made to internally analyze every compound, i.e. to treat one element as the head, by which other elements depend. While this solution should also provide Sanskrit scholars with an exhaustive description, it is not free from problems and every meaningful case will be accounted for. Looking at morphology, other issues are part-of-speech tagging of some words categories such as participles, pronominal words and adverbs, for choosing their lexical category has implications on determining their lemma and the syntactic relations they bear. Finally, some syntactic constructions are object of discussion. Among them, the preference which Sanskrit shows for sentences with no finite verb form and relative-correlative constructions. Every treebank has a set of guidelines and a documentation page which guide other annotators in their work and enable the corpus to grow. The guidelines help the solution of language specific issues that are not to be found cross-linguistically or don’t appear in English, the language on which most of the example in UD are based. In giving my contribution in writing a documentation for Sanskrit, my aim is to cover as many cases as possible, so that other kinds of text will one day be feasible to enter the corpus. Finally, in order to maximize parallelism, at every stage I compare my solutions with those taken for other classical languages.È consentito all'utente scaricare e condividere i documenti disponibili a testo pieno in UNITESI UNIPV nel rispetto della licenza Creative Commons del tipo CC BY NC ND.
Per maggiori informazioni e per verifiche sull'eventuale disponibilità del file scrivere a: unitesi@unipv.it.
https://hdl.handle.net/20.500.14239/9262