Digital Documentation of the Russian Language

Head of the Laboratory: PhD Anastasiya A. Bonch-Osmolovskaya

The aim of the laboratory is to create the corpus platform of the next generation (hereinafter - Corpus 2.0), which will be the foundation of the national reference system on Russian language. This system will be the crucial element of the scientific infrastructure in linguistics and humanities, with the Russian National Corpus (RNC) at its core. Taking into account that, firstly, the corpus technologies are developing rapidly, and, secondly, the needs of the Russian and international users of RNC are increasing as well, the RNC, created 15 years ago, requires complete modernization of the technological basis.

This laboratory is founded in 2020 with support from the Ministry of Science and Higher Education of the Russian Federation within the Agreement No 075-15-2020-793: “Next-generation computational linguistics platform for the Russian language digital recording: infrastructure, resources, research”. However, we should notice that Corpus 2.0 is not being created “from scratch”: RNC has become its important constituent.

In order to face the challenge, the consortium of five Russian universities is created. These universities are: A.A. Kharkevich Institute for information transmission problems of RAS (IITP RAS), V.V. Vinogradov Russian Language Institute of RAS (RLI RAS), National research university “Higher school of economics” (NRU HSE), Institute for Linguistic Studies of RAS (ILS RAS), and Voronezh State University (VSU). As IITP RAS is an umbrella organization for the consortium, our laboratory coordinates the work carried out by four other institutions.

The work within the project is divided into three blocks: the infrastructural block is responsible for the technological basis for the Corpus 2.0; the resource block provides the Corpus with new data and supplies it with new annotation; and the research block focuses on approbation of the Corpus for the specific scientific research tasks. These three blocks are subdivided into ten working groups, focusing on exact tasks:

  1. The group of system architecture is creating the hardware and software complex, which will be a new generation modular corpus platform;

  2. The group for data and annotation unification is responsible for the instruments that are applicable for the corpora with different annotation (e.g. Old-Russian corpus, poetic corpus, parallel corpus etc.), so that these instruments meet the standards of the data representation, such as search, conversion, download and storage;

  3. Statistics and visualisation group is creating the instrument for the elaborated statistical data analysis and visualization of the search results;

  4. Balance group is responsible for significant increase of the corpus size. This means, firstly, the increase in the main subcorpus of the standard Russian (at least 20M tokens for the texts of XVIII, XIX and XXI centuries, and at least 120M tokens for the press subcorpus); secondly, the increase in the subcorpus of the Russian language used in the Web (at least 50M tokens).

  5. Cross-cutting diachronic search group is creating the system which will allow the search over different subcorpora of the Russian language, especially the corpora of different historical periods (Old Russian, Middle Russian and Modern Russian).

  6. Specialized subcorpora group is subdivided into several teams, which are managing the corpora with specialized annotation within the Corpus 2.0. The aim of the teams is to replenish these corpora, namely, the poetic subcorpus, the historical and parallel subcorpora, and the syntactically annotated subcorpus SinTagRus.

  7. Group of Children’s corpus is creating a new annotated subcorpus of the modern children’s literature.

  8. Dataset group, whose aim is to preprocess and publish open-source annotated collections of the Russian texts for the purposes of machine and neural learning in the Russian language and in AI.

  9. Experimental research group is doing research based on Corpus 2.0 and combining it with psycho- and neurolinguistic experiments in the related fields, namely, Russian grammar and lexicon.

  10. Group for the Russian corpus grammar (RusGram), which aims at completing the work on the holistic corpus-based theoretical description of the Russian language. The main fields of the current research are: grammar, syntax and morphophonology. The team is planning to publish the results of the work as a printed version of the new Russian grammar.

  11. Group of Russian Constructicon and Micro-syntax consists of two teams. The first one is dealing with the Russian Constructicon and aims at updating this database; the second team is focusing on the research of the constructions within the framework of Micro-syntax.

The results of Groups №№1-3 belong to the infrastructural block, the results of Groups №№3-9 - to the resource block; the results of Groups №№9-11 - to the research block.