The feasibility of a complete text corpus


Primož Jakopin (Slovenian Corpus Laboratory Fran Ramovš Institute of Slovenian Language ZRC SAZU Novi trg 4, 1000 Ljubljana, Slovenia)


WP1: Corpora & Corpus Tools


In the paper the annual increase in size of a complete text corpus of a single language, Slovenian, is estimated. It comprises the serial publications in Slovenian, monographs and pages, published on Internet. The estimate for the year 2000, based on 21,000 units of serial publications, 675,000 pages from 5,200 units of printed monographs, 377.000 pages from 5,500 units of unpublished monographs (mostly academic theses) and 300,000 pages on Internet is given at less than 1.5 billion words. An extension of the Law of legal deposit, which would also cover electronic versions of printed texts, is proposed. It is suggested that to make the idea of a complete corpus viable, it should be simple and profitable for the publishers to supply web versions of their publications alongside with printed ones.


Text corpus

Full Paper