Udmurt corpora


Welcome to the start page of Udmurt language corpora: the Main corpuspress, blogs, some fiction and non-fiction of literary Udmurt, the Corpus of Udmurt-language social media and the Sound-aligned corpus of Udmurt dialects.

Udmurt corpora

Toggle navigation

This is the main page of the website where linguistic corpora of Udmurt language are located. Currently, three corpora are available: the corpus of contemporary written literary Udmurt (“the Main corpus”), the corpus of Udmurt-language social media and the Sound-aligned corpus of Udmurt dialects. They differ in what kind of texts the contain, but have mostly identical annotation and search capabilities. Here is a brief comparison:

Main corpus Social media corpus Sound-aligned corpus
Language Udmurt Udmurt and Russian Udmurt
Size 9.57 million words 2.66 million words (the Udmurt part)
9.83 million words (the Russian part)
11 thousand words (not including the interviewer)
Texts contemporary press (up to January 2018) — 91.3%, blogs — 5.1%, the New testament, Wikipedia, fiction, non-fiction open posts and comments by Udmurt-speaking vkontakte users (up to February 2018) sound-aligned transcriptions of recordings collected by Ekaterina Georgieva in 2014
Language variety in most cases, standard written literary Udmurt or close to it language of digital communication: closer to the spoken variety, influenced by the dialects and Russian language, contains numerous code switching instances spoken/dialectal Udmurt (transcriptions are close to the standard orthography)
Annotation
  • automatic morphological annotation (lemmatization, part of speech, all inflectional features), 97.2% words analyzedonly tokens that do not contain digits or Latin characters are taken into account
  • partial rule-based disambiguation (1.4 analyses per analyzed token on average)
  • annotation of Russian loanwords
  • annotation of several lexical/semantic classes: animate/human nouns, body parts, transport, different classes of proper names
  • annotation of transitivity and (sometimes) subcategorization frames of verbs
  • glossing
  • Russian translation of lemmata
  • automatic morphological annotation (lemmatization, part of speech, all inflectional features), 93.5% words analyzedonly tokens that do not contain digits or Latin characters are taken into account
  • partial rule-based disambiguation (1.4 analyses per analyzed token on average)
  • annotation of Russian loanwords
  • annotation of several lexical/semantic classes: animate/human nouns, body parts, transport, different classes of proper names
  • annotation of transitivity and (sometimes) subcategorization frames of verbs
  • glossing
  • Russian translation of lemmata
  • automatic morphological annotation (lemmatization, part of speech, all inflectional features), 93.5% words analyzedonly tokens that do not contain digits or Latin characters are taken into account
  • partial rule-based disambiguation (1.4 analyses per analyzed token on average)
  • annotation of Russian loanwords
  • annotation of several lexical/semantic classes: animate/human nouns, body parts, transport, different classes of proper names
  • annotation of transitivity and (sometimes) subcategorization frames of verbs
  • glossing
  • Russian translation of lemmata
Metadata
  • title of the text
  • author or title of the newspaper
  • creation year (exact date in the case of newspapers)
  • genre
  • group name (for groups)
  • publicly available user metadata: sex (for everyone); if available, also birth year (grouped in 5-year spans), place of origin and current location (grouped by districts of Udmurtia or regions of Russia); real names and nicknames of the users are hidden
  • creation year
  • message type (post/comment)
  • language (tagged automatically, independently for each sentence)
  • title of the text
  • place of the recording (district)
  • author of the recording
  • year of recording

You can find more detailed information about Udmurt Social media corpus and its development in this paper. Please consider citing this paper if your research is based on this corpus:

Timofey Arkhangelskiy. 2019. Corpora of social media in minority Uralic languages. Proceedings of the fifth Workshop on Computational Linguistics for Uralic Languages, pages 125–140, Tartu, Estonia, January 7 - January 8, 2019.

The Sound-aligned Udmurt Corpus aims at documenting the spoken/dialectal varieties of Udmurt. This corpus is a joint project between Ekaterina Georgieva (Hungarian Academy of Sciences) and Timofey Arkhangelskiy. The significance of the corpus is threefold. Firstly, collections of spoken texts are a particularly valuable resource in the case of endangered and under-documented languages like Udmurt. This highlights the importance of making fieldwork data open and reusable for the researchers, and possibly, for the native speaker community as well. Secondly, the Sound-aligned Udmurt Corpus documents the spoken/dialectal varieties of Udmurt in contrast to previous corpus-building projects which focused mostly on the standard variety. In this way, this corpus contributes to the language documentation and corpus-building efforts for Udmurt. Thirdly, our goal is to encourage other researches who have done fieldwork with Udmurt native speakers to deposit their audio files and transcriptions to the corpus. We hope that our project will become a platform on which Udmurt fieldwork data can be shared.

The corpus currently consists of recordings collected by Ekaterina Georgieva during a fieldwork trip in July and August 2014. The collection points were the Alnash, Grakh and Balezino districts of the Republic of Udmurtia as well as the capital Izhevsk; additionally, data were also collected in Kukmor district (Republic of Tatarstan). The fieldwork recordings exemplify the spoken varieties of Udmurt and also show certain dialectal features characteristic of the three dialect groups represented in the corpus: the Northern, Central and Southern (both Southern and Southern Peripheral) dialects. The recordings contain mostly semi-structured interviews. The interviews cover different genres, such as narratives, informal conversations between speakers, description of customs, etc. Udmurt was the only medium during the recording sessions. In some of the recordings, only one native speaker was interviewed by the (non-native) fieldwork linguist, while in others, the informant(s) was/were interviewed with the help of another native speaker.

The recordings were transcribed and time-aligned in ELAN. The audio files were utterance/sentence-level time-aligned. The transcription follows the Cyrillic script used in the case of standard Udmurt. Discourse and extralinguistic elements were also annotated. The morphological analysis was carried out using the morphological analyzer used for standard Udmurt. More information on the text-processing steps can be found in the following paper:

Arkhangelskiy, Timofey & Ekaterina Georgieva. 2018. Sound-aligned corpus of Udmurt dialectal texts. In: Pirinen, Tommi A. (ed.), Proceedings of the 4th International Workshop for Computational Linguistics for Uralic Languages (IWCLUL 2018), 26–38. Stroudsburg (PA): Association for Computational Linguistics.

Please consider citing this paper if you use the corpus in your research.

If you wish to deposit your fieldwork data to the sound-aligned corpus, please contact Timofey Arkhangelskiy.

What is a corpus?

A language corpus is a collection of texts in that language which has been enriched with additional linguistic information, called annotation, and, preferably, equipped with a search engine. Here you will find a short list of frequently asked questions about the Udmurt corpora.

— Who needs corpora?

First of all, corpora are used by linguists. The search engine and annotation of corpora are designed in such a way that you can make linguistic queries such as “find all nouns in the genitive case” or “find all forms of the word коӵыш followed by a verb”. Apart from linguists, corpus can be a useful tool for language teachers, language learners, and even the native speakers.

— Can I use the corpus as a library?

No, these corpora are not designed for that. When you work with a corpus, you make a query, i.e. search for a particular word, phrase or construction, and get back all sentences that contain what you searched for. By default, the sentences are showed in random order. You can expand the context of each of the sentences you get, i.e. look at their neighboring sentences. However, you may do so only a limited number of times for each sentence. Therefore, it is impossible to read an entire text in the corpus. This is done for copyright protection.

— Can I use the corpus as a dictionary?

Each Udmurt word in the corpus has Russian translation (no English translations are available at the moment). However, they are only provided as auxiliary information for users who do not speak Udmurt. The translations in the corpus are kept short and simple by design, they do not list all senses and do not provide usage examples like real dictionaries. If you want to know how to translate a word, the right way to do so is consulting a dictionary (for Udmurt-Russian dictionary, look e.g. here).

— What is morphological annotation and how do you get it?

The corpora located here are lemmatized and morphologically annotated. Lemmatization means that each word in the texts is annotated with its lemma, i.e. dictionary/citation form. Morphological annotation means that each word is annotated for its grammatical features, such as part of speech, number, case, tense, etc. Since the corpora in question are too large for manual annotation to be feasible, they were annotated automatically with a program called morphological analyzer. The analyzer uses a manually compiled grammatical dictionary and a formalized description of Udmurt inflection. The analyzer together with the necessary materials is freely available in my github repository. Automatic annotation unfortunately means that, first, out-of-vocabulary words are not annotated, and, second, that some words have several ambiguous analyses. For example, confronted with the form каре, the analyzer cannot determine whether it should be analyzed as 1sg possessive of кар (“my city”), illative of the same word (“to the city”) or even a form of the verb карыны “do”. Ambiguity as partly removed with manually designed context rules. Russian sentences in the social media corpus were annotated with the mystem analyzer.

Udmurt language

Udmurt belongs to the Permic group of Uralic languages. It is spoken by approximately 330,000 people who live mainly in the Russian region of Udmurtia and neighboring regions. Udmurt uses Cyrillic orthography, the alphabet consists of Russian letters and several additional letters with diacritics. All morphological markers are suffixes that attach to the stem agglutinatively. There are no traces of the vowel harmony, whih is believed to have existed in proto-Uralic. Nominal grammatical categories are number, case, and possessiveness. Verbs have four morphological tenses (direct and evidential past, present and future) and index person and number of the subject. The direct object can be marked either in the nominative or in the accusative (DOM). The word order in the sentence is free, with SOV (subject – object – verb) being the default.

Tagset

The grammatical features of the words in the corpora are marked with short tags. Here is the full list of tags used in Udmurt corpora. All three corpora have identical set of tags.

  • ADJ — adjective
  • ADJPRO — adjectival pronoun
  • ADV — adverb
  • ADVPRO — adverbial pronoun
  • CNJ — conjunction
  • IMIT — ideophone
  • INTRJ — interjection
  • N — noun
  • NUM — numeral
  • PARENTH — parenthetic word
  • PART — particle
  • PN — proper noun (subtype of nouns)
  • POST — postposition
  • PREDIC — predicative
  • PRO — pronoun
  • V — verb
  • I — 1st conjugation (not in -a)
  • II — 2nd conjugation (in -a)
  • 1 — 1st peron in verbs
  • 1pl — 1pl possessive
  • 1sg — 1sg possessive
  • 2 — 2nd peron in verbs
  • 2pl — 2pl possessive
  • 2sg — 2sg possessive
  • 3 — 3rd peron in verbs
  • 3pl — 3pl possessive
  • 3sg — 3sg possessive
  • abbr — abbreviation
  • abl — ablative
  • acc — accusative
  • adj_nmlz — nominalized adjective
  • adv — adverbial case
  • anim — animate noun
  • app — approximative
  • attr — any attributivizer
  • attr_em — attributivizer in -em
  • attr_o — attributivizer in -o
  • attr_tem — attributivizer in -tem
  • body — body part
  • car — caritive
  • case_comp — case compounding
  • caus — causative
  • coll — collective numeral
  • comp — comparative degree
  • cvb — converb
  • cvb.gen — general converb in -sa
  • cvb,dur — converb in -onja-
  • cvb,lim — converb in -toź
  • cvb,simult — converb in -ku
  • dat — dative
  • deb — debitive (form in -ono)
  • distr — distributive numeral
  • egr — egressive
  • el — elative
  • evid — 2nd (evidential) past tense
  • famn — family name
  • fict — fictitious form in -emjaśk-
  • fut — future tense
  • gen — genitive
  • hort — hortative (verbal form in -ome-)
  • hum — human
  • ill — illative
  • imp — imperative
  • impers — impersonal verb
  • inf — infinitive
  • ins — instrumental
  • intr — intransitive verb
  • iter — iterative
  • loc — locative (inessive)
  • missp — typo
  • mult — multiplicative (-ja-)
  • neg — negative form
  • nom — nominative
  • oblin — oblinative (adjective in -eś)
  • ord — ordinal numeral
  • pass — passive
  • patrn — patronymic
  • period — form in -ski̮n
  • persn — personal (given) name
  • pl — plural
  • poss_comp — possessive in case compounding
  • prol — prolative
  • prs — present tense
  • pst — past tense
  • ptcp — participle
  • ptcp,act — active participle
  • ptcp,pass — participle in -mon
  • ptcp,res — resultative participle in -mi̮n
  • refl — reflexive pronoun
  • rel_adj — relational adjective
  • rel_n — relational name (inflected postposition)
  • rus — Russian borrowing (or borrowing through Russian)
  • rus_afx — Russian affix with Udmurt stem
  • rus_inf — infinitive of a Russian verb
  • rus_ipfv — imperfective aspect of a Russian verb
  • rus_pfv — perfective aspect of a Russian verb
  • rus_pfv_ipfv — perfective/imperfective aspect of a Russian verb
  • sg — singular
  • subj — subjunctive (conditional) mood
  • supernat — noun that denotes a supernatural beingThis category is a byproduct of animacy/humanness annotation. Since it is not clear whether these cases should be classified as human, we put them in a separate box, so that the user can decide that for themselves.
  • term — terminative
  • time_meas — time measurement unit
  • topn — toponym (geographical name)
  • tr — transitive verb
  • transport — transport
  • vn — verbal noun in -on
  • with_abl — verb that can have an argument in ablative
  • with_dat — verb that can have an argument in dative
  • with_el — verb that can have an argument in elative
  • with_ill — verb that can have an argument in illative
  • with_inf — verb that can have an inifinitive argument
  • with_instr — verb that can have an argument in instrumental

The tagset for the Russian-language part (Russian sentences in the social media corpus) can be found in the Russian National Corpus.

Authors

The first version of the (main) Udmurt corpus was developed by Maria Medvedeva and Timofey Arkhangelskiy in the School of Linguistics at HSE. Since the first version was released in 2014, the written corpora and morphological analyzer are maintained and developed by Timofey Arkhangelskiy. The social media corpus was released in 2018 as part of his postdoctoral project supported by Alexander von Humboldt Foundation. The sound-aligned corpus, launched in 2018, is a joint project of Ekaterina Georgieva and Timofey Arkhangelskiy. All corpora are hosted by the School of linguistics at HSE, Moscow.

Contacts


If you have questions, would like to propose collaboration, or noticed an error in the corpusexcept typos in blogs and social media: these text are left "as is", please contact Timofey Arkhangelskiy. You can also use the Udmurt morphological analyzer and the tsakorpus corpus platform, which are open source and freely available.