Title Grapheme-level Awareness in Word Embeddings for Morphologically Rich Languages
Authors Suzi Park and Hyopil Shin
Abstract Learning word vectors from character level is an effective method to improve word embeddings for morphologically rich languages. However, most of these techniques have been applied to languages that are inflectional and written in Roman alphabets. In this paper, we investigate languages that are agglutinative and represented by non-alphabetic scripts, choosing Korean as a case study. We present a grapheme-level coding procedure for neural word embedding that utilizes word-internal features that are composed of syllable characters (Character CNN). Observing that our grapheme-level model is more capable of representing functional and semantic similarities, grouping allomorphs, and disambiguating homographs than syllable-level and word-level models, we recognize the importance of knowledge on the morphological typology and diversity of writing systems.
Topics Typological Databases, Language Modelling, Morphology
Full paper Grapheme-level Awareness in Word Embeddings for Morphologically Rich Languages
