MULTEXT-East Version 3: Multilingual Morphosyntactic Specifications, Lexicons and Corpora


Tomaž Erjavec

Department of Knowledge Technologies, Jožef Stefan Institute, Jamova 31, Ljubljana, Slovenia




The paper presents the third edition of the MULTEXT-East language resources, a multilingual dataset for language engineering research and development. This standardised and linked set of resources covers a large number of mainly Central and Eastern European languages and includes the EAGLES-based morphosyntactic specifications, defining the features that describe word-level syntactic annotations; medium scale morphosyntactic lexica; and annotated parallel, comparable, and speech corpora. The most important component is the linguistically annotated corpus consisting of Orwell's novel "1984" in the English original and translations. The resources are the results of several EU projects: MULTEXT-East (produced linked resources for Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian and English), TELRI (added resources for Lithuanian, Croatian, Serbian, and Russian; first release), and CONCEDE (validation, re-encoding; partial re-release). This paper presents the third release of the resources, which brings together the first two, makes them available in TEI P4 XML, and introduces further extensions, e.g. the specification for Resian, a dialect of Slovene. This dataset, unique in terms of languages and the wealth of encoding, is extensively documented, and freely available for research purposes. The paper presents the component resources, reviews some research undertaken on the basis of the first two editions, and discusses future plans.


Central and Eastern European languages, multilingual corpora, lexical, resources, morphosyntactic specifications, part-of-speech tagging

Language(s) Bulgarian, Croatian, Czech, English, Estonian, Hungarian, Latvian, Lithuanian, Resian, Romanian, Russian, Serbian, Slovene
Full Paper