Tatyshly Udmurt Corpus


Welcome to the start page of the corpus of local Udmurt varieties spoken in the Tatyshly district of Bashkortostan.

Details Search

Standard Udmurt corpora can be found here.

Tatyshly Udmurt Corpus

Toggle navigation

These are the main parameters of the corpus:

Language Udmurt
Size 38.5 thousand words (as of January 2024)
Texts - sound-aligned transcriptions of recordings in Tatyshly Udmurt collected by the field team of OTIPL MSU since 2019;
- published texts in Tatyshly Udmurt
Language variety dialectal Udmurt (transcriptions are close to the standard orthography; phonological transcription adopted in the field project or in the source is also provided)
Annotation
  • sentence-by-sentence Russian translations
  • automatic morphological annotation (lemmatization, part of speech, all inflectional features), 98.7% words analyzedonly tokens that do not contain digits or Latin characters are taken into account
  • partial rule-based disambiguation (1.5 analyses per analyzed token on average)
  • annotation of Russian loanwords
  • annotation of several lexical/semantic classes: animate/human nouns, body parts, transport, different classes of proper names
  • annotation of transitivity and (sometimes) subcategorization frames of verbs
  • glossing
  • Russian translation of lemmata
Metadata
  • title of the text
  • year of the recording
  • subcorpus (sound-aligned field transcriptions or published texts)
  • codes of speakers
  • codes of linguists who took part in recording and transcribing
  • codes of native speakers who helped to transcribe the text
  • birth place of the speaker
  • current place of residence of the speaker
  • birth year of the speaker
  • genre

The corpus includes two kinds of Tatyshly Udmurt texts. The spoken subcorpus comprises texts recorded and transcribed in the field project of the Department of Theoretical and Applied Linguistics (OTIPL), Faculty of Philology, Lomonosov Moscow State University. The team has been working in the Tatyshly district of the Republic of Bashkortostan, Russia, primarily in the villages of Nizhnebaltachevo, Staryj Kyzyl-Jar, Ivanovka, and Novye Tatyshly. In addition, the villages of Starokalmijarovo, Verkhnebaltachevo, and Urazgildy have been visited. The team leader is Egor Kashkin. More information about the project is available on its website. The published subcorpus includes transcribed dialectal texts from earlier publications (no sound is available for them).

The transcriptions are close to the standard orthography but not identical to it. We normalize regular phonetic correspondences (e.g., the letter ы is used, whereas a reduced vowel is pronounced in this position in Tatyshly Udmurt; the letter ӵ marking a non-palatalized affricate is used, the latter corresponding to š or to a palatalized affricate č' in Tatyshly Udmurt). We do not normalize correspondences in affixes or at morpheme boundaries which cannot be explained through regular phonetic processes (e.g. the affix of the elative is written as -ись / -ӥсь, cf. -ысь in Standard Udmurt; the converb from луыны can be written as луыса or луса depending on how it was pronounced). More details on developing transcriptional system for Udmurt dialect corpora are available in this paper.

In addition to the text in Cyrillic, phonological transcription adopted in the field project is provided.

What is a corpus?

A language corpus is a collection of texts in that language which has been enriched with additional linguistic information, called annotation, and, preferably, equipped with a search engine. Here you will find a short list of frequently asked questions about the corpus.

— Who needs corpora?

First of all, corpora are used by linguists. The search engine and annotation of corpora are designed in such a way that you can make linguistic queries such as “find all nouns in the genitive case” or “find all forms of the word коӵыш followed by a verb”. Apart from linguists, corpus can be a useful tool for language teachers, language learners, and even the native speakers.

— Can I use the corpus as a library?

No, these corpora are not designed for that. When you work with a corpus, you make a query, i.e. search for a particular word, phrase or construction, and get back all sentences that contain what you searched for. By default, the sentences are showed in random order. You can expand the context of each of the sentences you get, i.e. look at their neighboring sentences. However, you may do so only a limited number of times for each sentence. Therefore, it is impossible to read an entire text in the corpus. This is done for copyright protection.

— Can I use the corpus as a dictionary?

Each Udmurt word in the corpus has Russian translation (no English translations are available at the moment). However, they are only provided as auxiliary information for users who do not speak Udmurt. The translations in the corpus are kept short and simple by design, they do not list all senses and do not provide usage examples like real dictionaries. If you want to know how to translate a word, the right way to do so is consulting a dictionary (for Udmurt-Russian dictionary, look e.g. here).

— What is morphological annotation and how do you get it?

The corpus located here is lemmatized and morphologically annotated. Lemmatization means that each word in the texts is annotated with its lemma, i.e. dictionary/citation form. Morphological annotation means that each word is annotated for its grammatical features, such as part of speech, number, case, tense, etc. Since the corpora in question are too large for manual annotation to be feasible, they were annotated automatically with a program called morphological analyzer. The analyzer uses a manually compiled grammatical dictionary and a formalized description of Udmurt inflection. The analyzer together with the necessary materials is freely available on Github. Automatic annotation unfortunately means that, first, out-of-vocabulary words are not annotated, and, second, that some words have several ambiguous analyses. For example, confronted with the form каре, the analyzer cannot determine whether it should be analyzed as 1sg possessive of кар (“my city”), illative of the same word (“to the city”) or even a form of the verb карыны “do”. Ambiguity as partly removed with manually designed context rules. Russian translations were annotated with the mystem analyzer.

Udmurt language

Udmurt belongs to the Permic group of Uralic languages. It is spoken by approximately 330,000 people who live mainly in the Russian region of Udmurtia and neighboring regions. Udmurt uses Cyrillic orthography, the alphabet consists of Russian letters and several additional letters with diacritics. All morphological markers are suffixes that attach to the stem agglutinatively. There are no traces of the vowel harmony, whih is believed to have existed in proto-Uralic. Nominal grammatical categories are number, case, and possessiveness. Verbs have four morphological tenses (direct and evidential past, present and future) and index person and number of the subject. The direct object can be marked either in the nominative or in the accusative (DOM). The word order in the sentence is free, with SOV (subject – object – verb) being the default.

Tagset

The grammatical features of the words in the corpora are marked with short tags. Here is the full list of tags used in the corpus.

  • ADJ — adjective
  • ADJPRO — adjectival pronoun
  • ADV — adverb
  • ADVPRO — adverbial pronoun
  • CNJ — conjunction
  • IMIT — ideophone
  • INTRJ — interjection
  • N — noun
  • NUM — numeral
  • PARENTH — parenthetic word
  • PART — particle
  • PN — proper noun (subtype of nouns)
  • POST — postposition
  • PREDIC — predicative
  • PRO — pronoun
  • V — verb
  • I — 1st conjugation (not in -a)
  • II — 2nd conjugation (in -a)
  • 1 — 1st peron in verbs
  • 1pl — 1pl possessive
  • 1sg — 1sg possessive
  • 2 — 2nd peron in verbs
  • 2pl — 2pl possessive
  • 2sg — 2sg possessive
  • 3 — 3rd peron in verbs
  • 3pl — 3pl possessive
  • 3sg — 3sg possessive
  • abbr — abbreviation
  • abl — ablative
  • acc — accusative
  • adj_nmlz — nominalized adjective
  • adv — adverbial case
  • anim — animate noun
  • app — approximative
  • attr — any attributivizer
  • attr_em — attributivizer in -em
  • attr_o — attributivizer in -o
  • attr_tem — attributivizer in -tem
  • body — body part
  • car — caritive
  • case_comp — case compounding
  • caus — causative
  • coll — collective numeral
  • comp — comparative degree
  • cvb — converb
  • cvb.gen — general converb in -sa
  • cvb,dur — converb in -onja-
  • cvb,lim — converb in -toź
  • cvb,simult — converb in -ku
  • dat — dative
  • deb — debitive (form in -ono)
  • distr — distributive numeral
  • egr — egressive
  • el — elative
  • evid — 2nd (evidential) past tense
  • famn — family name
  • fict — fictitious form in -emjaśk-
  • fut — future tense
  • gen — genitive
  • hort — hortative (verbal form in -ome-)
  • hum — human
  • ill — illative
  • imp — imperative
  • impers — impersonal verb
  • inf — infinitive
  • ins — instrumental
  • intr — intransitive verb
  • iter — iterative
  • loc — locative (inessive)
  • missp — typo
  • mult — multiplicative (-ja-)
  • neg — negative form
  • nom — nominative
  • oblin — oblinative (adjective in -eś)
  • ord — ordinal numeral
  • pass — passive
  • patrn — patronymic
  • period — form in -ski̮n
  • persn — personal (given) name
  • pl — plural
  • poss_comp — possessive in case compounding
  • prol — prolative
  • prs — present tense
  • pst — past tense
  • ptcp — participle
  • ptcp,act — active participle
  • ptcp,pass — participle in -mon
  • ptcp,res — resultative participle in -mi̮n
  • refl — reflexive pronoun
  • rel_adj — relational adjective
  • rel_n — relational name (inflected postposition)
  • rus — Russian borrowing (or borrowing through Russian)
  • rus_afx — Russian affix with Udmurt stem
  • rus_inf — infinitive of a Russian verb
  • rus_ipfv — imperfective aspect of a Russian verb
  • rus_pfv — perfective aspect of a Russian verb
  • rus_pfv_ipfv — perfective/imperfective aspect of a Russian verb
  • sg — singular
  • subj — subjunctive (conditional) mood
  • supernat — noun that denotes a supernatural beingThis category is a byproduct of animacy/humanness annotation. Since it is not clear whether these cases should be classified as human, we put them in a separate box, so that the user can decide that for themselves.
  • term — terminative
  • time_meas — time measurement unit
  • topn — toponym (geographical name)
  • tr — transitive verb
  • transport — transport
  • vn — verbal noun in -on
  • with_abl — verb that can have an argument in ablative
  • with_dat — verb that can have an argument in dative
  • with_el — verb that can have an argument in elative
  • with_ill — verb that can have an argument in illative
  • with_inf — verb that can have an inifinitive argument
  • with_instr — verb that can have an argument in instrumental

The tagset for the Russian-language part (Russian translations) can be found in the Russian National Corpus.

Contact


If you have questions, would like to propose collaboration, or noticed an error in the corpus, please contact Egor Kashkin.

Mistakes found in the corpus may be marked in a Google Form.