Summary of the paper

Title Targeting Chinese Nominal Compounds in Corpora
Authors Weiruo Qu, Christoph Ringlstetter and Randy Goebel
Abstract For compounding languages, a great part of the topical semantics is transported via nominal compounds. Various applications of natural language processing can profit from explicit access to these compounds, provided by a lexicon. The best way to acquire such a resource is to harvest corpora that represent the domain in question. For Chinese, a significant difficulty lies in the fact that the text comes as a string of characters, only segmented by sentence boundaries. Extraction algorithms that solely rely on context variety do not perform precisely enough. We propose a pipeline of filters that starts from a candidate set established by accessor variety and then employs several methods to improve precision. For the experiments the Xinhua part of the Chinese Gigaword Corpus was used. We extracted a random sample of 200 story texts with 119,509 Hanzi characters. All compound words of this evaluation corpus were tagged, segmented into their morphemes, and augmented with the POS-information of their segments. A cascade of filters applied to a preliminary set of compound candidates led to a very high precision of over 90%, measured for the types. The result also holds for a small corpus where a solely contextual method introduces too much noise, even for the longer compounds. An introduction of MI into the basic candidacy algorithm led to a much higher recall with still reasonable precision for subsequent manual processing. Especially for the four-character compounds, that in our sample represent over 40% of the target data, the method has sufficient efficacy to support the rapid construction of compound dictionaries from domain corpora.
Topics MultiWord Expressions & Collocations, Statistical methods, Corpus (creation, annotation, etc.)
Full paper Targeting Chinese Nominal Compounds in Corpora
Slides -
Bibtex @InProceedings{QU08.662,
  author = {Weiruo Qu, Christoph Ringlstetter and Randy Goebel},
  title = {Targeting Chinese Nominal Compounds in Corpora},
  booktitle = {Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)},
  year = {2008},
  month = {may},
  date = {28-30},
  address = {Marrakech, Morocco},
  editor = {Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Daniel Tapias},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {2-9517408-4-0},
  note = {},
  language = {english}

Powered by ELDA © 2008 ELDA/ELRA