Creating open language resources for Hungarian


Péter Halácsy (1), András Kornai (2), László Németh (1), András Rung (1), István Szakadát (1), Viktor Trón (3)

(1) Budapest Institute of Technology Media Research and Education Center, mail: {halacsy,nemeth,rung,szakadat}@mokk.bme.hu; (2) MetaCarta Inc. mail: andras@kornai.com; (3) International Graduate College, Saarland University and University of Edinburgh, mail: v.tron@ed.ac.uk




With Hungary's ascension to the EU, wider availability of Hungarian language resources (LRs) is becoming more critical. Various Hungarian LRs and language technology tools (LTs) exist, but are for the most part proprietary products: the companies and research labs developing them are often reluctant to make them available even for research, let alone commercial purposes. The SzoSzablya `WordSword' project at the Centre of Media Research and Education of Budapest University of Technology and Economics started in March 2003 with the express goal to offer a solution to this problem by developing a comprehensive set of LRs with an LT toolkit which are made publicly available under an unrestrictive LGPL-style license. This paper is a report of our progress. We describe the process of creating a gigaword corpus by crawling the Hungarian web and collecting 18 million webpages from the .hu domain. We discuss the methods used for cleaning the data and document the way an approximate frequency dictionary was compiled from the corpus.


gigaword corpus, web, open source LRs, frequency dictionary

Language(s) Hungarian
