These are the main parameters of the corpus:
|Size||19.6 thousand words (as of February 2023)|
|Texts||- sound-aligned transcriptions of recordings in Tatyshly Udmurt collected by the field team of OTIPL MSU since 2019;
- published texts in Tatyshly Udmurt
|Language variety||dialectal Udmurt (transcriptions are close to the standard orthography; phonological transcription adopted in the field project or in the source is also provided)|
The corpus includes two kinds of Tatyshly Udmurt texts. The spoken subcorpus comprises texts recorded and transcribed in the field project of the Department of Theoretical and Applied Linguistics (OTIPL), Faculty of Philology, Lomonosov Moscow State University. The team has been working in the Tatyshly district of the Republic of Bashkortostan, Russia, primarily in the villages of Nizhnebaltachevo, Staryj Kyzyl-Jar, Ivanovka, and Novye Tatyshly. In addition, the villages of Starokalmijarovo, Verkhnebaltachevo, and Urazgildy have been visited. The team leader is Egor Kashkin. More information about the project is available on its website. The published subcorpus includes transcribed dialectal texts from earlier publications (no sound is available for them).
The transcriptions are close to the standard orthography but not identical to it. We normalize regular phonetic correspondences (e.g., the letter ы is used, whereas a reduced vowel is pronounced in this position in Tatyshly Udmurt; the letter ӵ marking a non-palatalized affricate is used, the latter corresponding to š or to a palatalized affricate č' in Tatyshly Udmurt). We do not normalize correspondences in affixes or at morpheme boundaries which cannot be explained through regular phonetic processes (e.g. the affix of the elative is written as -ись / -ӥсь, cf. -ысь in Standard Udmurt; the converb from луыны can be written as луыса or луса depending on how it was pronounced). More details on developing transcriptional system for Udmurt dialect corpora are available in this paper.
In addition to the text in Cyrillic, phonological transcription adopted in the field project is provided.
A language corpus is a collection of texts in that language which has been enriched with additional linguistic information, called annotation, and, preferably, equipped with a search engine. Here you will find a short list of frequently asked questions about the corpus.
— Who needs corpora?
First of all, corpora are used by linguists. The search engine and annotation of corpora are designed in such a way that you can make linguistic queries such as “find all nouns in the genitive case” or “find all forms of the word коӵыш followed by a verb”. Apart from linguists, corpus can be a useful tool for language teachers, language learners, and even the native speakers.
— Can I use the corpus as a library?
No, these corpora are not designed for that. When you work with a corpus, you make a query, i.e. search for a particular word, phrase or construction, and get back all sentences that contain what you searched for. By default, the sentences are showed in random order. You can expand the context of each of the sentences you get, i.e. look at their neighboring sentences. However, you may do so only a limited number of times for each sentence. Therefore, it is impossible to read an entire text in the corpus. This is done for copyright protection.
— Can I use the corpus as a dictionary?
Each Udmurt word in the corpus has Russian translation (no English translations are available at the moment). However, they are only provided as auxiliary information for users who do not speak Udmurt. The translations in the corpus are kept short and simple by design, they do not list all senses and do not provide usage examples like real dictionaries. If you want to know how to translate a word, the right way to do so is consulting a dictionary (for Udmurt-Russian dictionary, look e.g. here).
— What is morphological annotation and how do you get it?
The corpus located here is lemmatized and morphologically annotated. Lemmatization means that each word in the texts is annotated with its lemma, i.e. dictionary/citation form. Morphological annotation means that each word is annotated for its grammatical features, such as part of speech, number, case, tense, etc. Since the corpora in question are too large for manual annotation to be feasible, they were annotated automatically with a program called morphological analyzer. The analyzer uses a manually compiled grammatical dictionary and a formalized description of Udmurt inflection. The analyzer together with the necessary materials is freely available in my github repository. Automatic annotation unfortunately means that, first, out-of-vocabulary words are not annotated, and, second, that some words have several ambiguous analyses. For example, confronted with the form каре, the analyzer cannot determine whether it should be analyzed as 1sg possessive of кар (“my city”), illative of the same word (“to the city”) or even a form of the verb карыны “do”. Ambiguity as partly removed with manually designed context rules. Russian translations were annotated with the mystem analyzer.
Udmurt belongs to the Permic group of Uralic languages. It is spoken by approximately 330,000 people who live mainly in the Russian region of Udmurtia and neighboring regions. Udmurt uses Cyrillic orthography, the alphabet consists of Russian letters and several additional letters with diacritics. All morphological markers are suffixes that attach to the stem agglutinatively. There are no traces of the vowel harmony, whih is believed to have existed in proto-Uralic. Nominal grammatical categories are number, case, and possessiveness. Verbs have four morphological tenses (direct and evidential past, present and future) and index person and number of the subject. The direct object can be marked either in the nominative or in the accusative (DOM). The word order in the sentence is free, with SOV (subject – object – verb) being the default.
If you have questions, would like to propose collaboration, or noticed an error in the corpus, please contact Egor Kashkin.