SUMMARY : Session O4-S Speech Corpora and Dialogue


Title Spoken Russian in the Russian National Corpus (RNC)
Authors E. Grishina
Abstract The RNC now it is a 120 million-word collection of Russian text, thus, it is the most representative and authoritative corpus of the Russian language. It is available in the Internet at The RNC contains texts of all genres and types, which covers Russian from 19 up to 21 centuries. The practice of national corpora constructing has revealed that it's indispensable to include in the RNC the sub-corpora of spoken language. Therefore, the constructors of the RNC have an intention to include in it about 10 million words of Spoken Russian. Oral speech in the Corpus is represented in the standard Russian orthography. Although this decision made impossible any phonetic exploration of the Spoken Russian Corpus, but studying Spoken Russian from any other linguistic point of view is completely available. In addition to traditional annotations (metatextual and morphological), in Spoken Sub-corpus there is sociological annotation. Unlike the standard oral speech, which is spontaneous and isn't intended to be reproduced, Multimedia Spoken Russian (MSR) is otherwise in great deal premeditated and evidently meant to be reproduced. MSR is also to be included in the RNC: first of all we plan to make the very interesting and provocative part of the RNC from the textual ingredient of about 300 Russian films.
Full paper Spoken Russian in the Russian National Corpus (RNC)