|LREC 2000 2nd International Conference on Language Resources & Evaluation
Previous Paper Next Paper
|Extraction of Unknown Words Using the Probability of Accepting the Kanji Character Sequence as One Word
|Shinnou Hiroyuki (Ibaraki University Dept. of Systems Engineering, 4-12-1 Nakanarusawa, Hitachi, Ibaeaki, 216-8511, Japan, firstname.lastname@example.org)
Ikeya Masanori (Ibaraki University Dept. of Systems Engineering, 4-12-1 Nakanarusawa, Hitachi, Ibaeaki, 216-8511, Japan, email@example.com)
|Session WO9 - Applications in the Written Area
|In this paper, we propose a method to extract unknown words, which are composed of two or three kahji characters, from Japanase text. Generally the known word composed of kanji characters are segmented into other words by the morphological analysis. Moreover, the appearance probability of each segmented word is small. By these features, we can define the measure of accepting two or three kanji character sequence as an unknown word. On the other hand, we can find some segmentation patterns of unknown words. By applying our measure to kanji character sequences which have these patterns, we can extract unknown words. In the experiment, the F-measuer for extraction of known words composed of two and three kanji characters was about 0.7 and 0.4 respectively. Our method does not need to use the frequency of the word in the training corpus to judge whether its word is the unknown word or not. Therefore, our method has the advantage that low frequent unknown words are extracted.