Expanding lexicons by inducing paradigms and validating attested forms
Gregory Grefenstette (Clairvoyance Corporation 5001 Baum Blvd, Pittsburgh, PA 15213, USA)
Yan Qu (Clairvoyance Corporation 5001 Baum Blvd, Pittsburgh, PA 15213, USA)
David A. Evans (Clairvoyance Corporation 5001 Baum Blvd, Pittsburgh, PA 15213, USA)
WO2: Acquisition Of Lexical Information
One of the bottlenecks in Natural Language Processing for a given language is creating a lexicon that covers the language. The morphological lexicon provides two important pieces of information for NLP applications: 1) the normalization of a word, its lemmatization, which allows the application to recognize two variants of the same word; and 2) the part-of-speech roles that the word can play, which allows the application to parse the text, creating relations between the words in a text. Many NLP applications, e.g. Information Retrieval, Classification, Terminology Extraction, etc., depend upon the normalization and parsing information found in lexicons. When words are not present in these lexicons, it is difficult to predict what their proper lemmatizations and parts-of-speech are. In this paper we present a technique for updating a lexicon given an unknown word via induction of paradigms from an existing, but incomplete, lexicon and validation of the paradigm using corpus evidence.