Tatyshly Udmurt Corpus

These are the main parameters of the corpus:

Language	Udmurt
Size	69.5 thousand words (as of November 2025)
Texts	- sound-aligned transcriptions of recordings in Tatyshly Udmurt collected by the field team of OTIPL MSU since 2019; - published texts in Tatyshly Udmurt
Language variety	dialectal Udmurt (transcriptions are close to the standard orthography; phonological transcription adopted in the field project or in the source is also provided)
Annotation	sentence-by-sentence Russian translations automatic morphological annotation (lemmatization, part of speech, all inflectional features), 97.7% words analyzedonly tokens that do not contain digits or Latin characters are taken into account partial rule-based disambiguation (1.5 analyses per analyzed token on average) annotation of Russian loanwords annotation of several lexical/semantic classes: animate/human nouns, body parts, transport, different classes of proper names annotation of transitivity and (sometimes) subcategorization frames of verbs glossing Russian translation of lemmata
Metadata	title of the text year of the recording subcorpus (sound-aligned field transcriptions or published texts) codes of speakers codes of linguists who took part in recording and transcribing codes of native speakers who helped to transcribe the text birth place of the speaker current place of residence of the speaker birth year of the speaker genre

The corpus includes two kinds of Tatyshly Udmurt texts. The spoken subcorpus comprises texts recorded and transcribed in the field project of the Department of Theoretical and Applied Linguistics (OTIPL), Faculty of Philology, Lomonosov Moscow State University. The team has been working in the Tatyshly district of the Republic of Bashkortostan, Russia, primarily in the villages of Nizhnebaltachevo, Staryj Kyzyl-Jar, Ivanovka, and Novye Tatyshly. In addition, the villages of Starokalmijarovo, Verkhnebaltachevo, and Urazgildy have been visited. The team leader is Egor Kashkin. More information about the project is available on its website. The published subcorpus includes transcribed dialectal texts from earlier publications (no sound is available for them).

The transcriptions are close to the standard orthography but not identical to it. We normalize regular phonetic correspondences (e.g., the letter ы is used, whereas a reduced vowel is pronounced in this position in Tatyshly Udmurt; the letter ӵ marking a non-palatalized affricate is used, the latter corresponding to š or to a palatalized affricate č' in Tatyshly Udmurt). We do not normalize correspondences in affixes or at morpheme boundaries which cannot be explained through regular phonetic processes (e.g. the affix of the elative is written as -ись / -ӥсь, cf. -ысь in Standard Udmurt; the converb from луыны can be written as луыса or луса depending on how it was pronounced). More details on developing transcriptional system for Udmurt dialect corpora are available in this paper.

In addition to the text in Cyrillic, phonological transcription adopted in the field project is provided.

What is a corpus?

A language corpus is a collection of texts in that language which has been enriched with additional linguistic information, called annotation, and, preferably, equipped with a search engine. Here you will find a short list of frequently asked questions about the corpus.

— Who needs corpora?

First of all, corpora are used by linguists. The search engine and annotation of corpora are designed in such a way that you can make linguistic queries such as “find all nouns in the genitive case” or “find all forms of the word коӵыш followed by a verb”. Apart from linguists, corpus can be a useful tool for language teachers, language learners, and even the native speakers.

— Can I use the corpus as a library?

No, these corpora are not designed for that. When you work with a corpus, you make a query, i.e. search for a particular word, phrase or construction, and get back all sentences that contain what you searched for. By default, the sentences are showed in random order. You can expand the context of each of the sentences you get, i.e. look at their neighboring sentences. However, you may do so only a limited number of times for each sentence. Therefore, it is impossible to read an entire text in the corpus. This is done for copyright protection.

— Can I use the corpus as a dictionary?

Each Udmurt word in the corpus has Russian translation (no English translations are available at the moment). However, they are only provided as auxiliary information for users who do not speak Udmurt. The translations in the corpus are kept short and simple by design, they do not list all senses and do not provide usage examples like real dictionaries. If you want to know how to translate a word, the right way to do so is consulting a dictionary (for Udmurt-Russian dictionary, look e.g. here).

— What is morphological annotation and how do you get it?

The corpus located here is lemmatized and morphologically annotated. Lemmatization means that each word in the texts is annotated with its lemma, i.e. dictionary/citation form. Morphological annotation means that each word is annotated for its grammatical features, such as part of speech, number, case, tense, etc. Since the corpora in question are too large for manual annotation to be feasible, they were annotated automatically with a program called morphological analyzer. The analyzer uses a manually compiled grammatical dictionary and a formalized description of Udmurt inflection. The analyzer together with the necessary materials is freely available on Github. Automatic annotation unfortunately means that, first, out-of-vocabulary words are not annotated, and, second, that some words have several ambiguous analyses. For example, confronted with the form каре, the analyzer cannot determine whether it should be analyzed as 1sg possessive of кар (“my city”), illative of the same word (“to the city”) or even a form of the verb карыны “do”. Ambiguity as partly removed with manually designed context rules. Russian translations were annotated with the mystem analyzer.

Udmurt language

Udmurt belongs to the Permic group of Uralic languages. It is spoken by approximately 330,000 people who live mainly in the Russian region of Udmurtia and neighboring regions. Udmurt uses Cyrillic orthography, the alphabet consists of Russian letters and several additional letters with diacritics. All morphological markers are suffixes that attach to the stem agglutinatively. There are no traces of the vowel harmony, whih is believed to have existed in proto-Uralic. Nominal grammatical categories are number, case, and possession. Verbs have four morphological tenses (direct and evidential past, present and future) and index person and number of the subject. The direct object can be marked either in the nominative or in the accusative (DOM). The word order in the sentence is free, with SOV (subject – object – verb) being the default.

Tagset

The grammatical features of the words in the corpora are marked with short tags. Here is the full list of tags used in the corpus.

ADJ — adjective
ADJPRO — adjectival pronoun
ADV — adverb
ADVPRO — adverbial pronoun
CNJ — conjunction
IMIT — ideophone
INTRJ — interjection
N — noun
NUM — numeral
PARENTH — parenthetic word
PART — particle
PN — proper noun (subtype of nouns)
POST — postposition
PREDIC — predicative
PRO — pronoun
V — verb
I — 1st conjugation (not in -a)
II — 2nd conjugation (in -a)
1 — 1st person in verbs
1pl — 1pl possessive
1sg — 1sg possessive
2 — 2nd person in verbs
2pl — 2pl possessive
2sg — 2sg possessive
3 — 3rd person in verbs
3pl — 3pl possessive
3sg — 3sg possessive
abbr — abbreviation
abl — ablative
acc — accusative
adj_nmlz — nominalized adjective
adv — adverbial case
anim — animate noun
app — approximative
attr — any attributivizer
attr_em — attributivizer in -em
attr_o — attributivizer in -o
attr_tem — attributivizer in -tem
body — body part
car — caritive
case_comp — case compounding
caus — causative
coll — collective numeral
comp — comparative degree
cvb — converb
cvb.gen — general converb in -sa
cvb,dur — converb in -onja-
cvb,lim — converb in -toź
cvb,simult — converb in -ku
dat — dative
deb — debitive (form in -ono)
distr — distributive numeral
egr — egressive
el — elative
evid — 2nd (evidential) past tense
famn — family name
fict — fictitious form in -emjaśk-
fut — future tense
gen — genitive
hort — hortative (verbal form in -ome-)
hum — human
ill — illative
imp — imperative
impers — impersonal verb
inf — infinitive
ins — instrumental
intr — intransitive verb
iter — iterative
loc — locative (inessive)
missp — typo
mult — multiplicative (-ja-)
neg — negative form
nom — nominative
oblin — oblinative (adjective in -eś)
ord — ordinal numeral
pass — passive
patrn — patronymic
period — form in -ski̮n
persn — personal (given) name
pl — plural
poss_comp — possessive in case compounding
prol — prolative
prs — present tense
pst — past tense
ptcp — participle
ptcp,act — active participle
ptcp,pass — participle in -mon
ptcp,res — resultative participle in -mi̮n
refl — reflexive pronoun
rel_adj — relational adjective
rel_n — relational name (inflected postposition)
rus — Russian borrowing (or borrowing through Russian)
rus_afx — Russian affix with Udmurt stem
rus_inf — infinitive of a Russian verb
rus_ipfv — imperfective aspect of a Russian verb
rus_pfv — perfective aspect of a Russian verb
rus_pfv_ipfv — perfective/imperfective aspect of a Russian verb
sg — singular
subj — subjunctive (conditional) mood
supernat — noun that denotes a supernatural beingThis category is a byproduct of animacy/humanness annotation. Since it is not clear whether these cases should be classified as human, we put them in a separate box, so that the user can decide that for themselves.
term — terminative
time_meas — time measurement unit
topn — toponym (geographical name)
tr — transitive verb
transport — transport
vn — verbal noun in -on
with_abl — verb that can have an argument in ablative
with_dat — verb that can have an argument in dative
with_el — verb that can have an argument in elative
with_ill — verb that can have an argument in illative
with_inf — verb that can have an inifinitive argument
with_instr — verb that can have an argument in instrumental

The tagset for the Russian-language part (Russian translations) can be found in the Russian National Corpus.

Tatyshly Udmurt Corpus

What is a corpus?

Udmurt language

Tagset

Contact