Grouping Synonymous Sentences from a Parallel Corpus
ATR Spoken Language Translation Research Laboratories
Recently, natural language processing researches have focused on data or processing techniques for paraphrasing. Unfortunately, however, we have little data for paraphrasing. There are some research reports on collecting synonymous expressions with parallel corpus, though no suitable corpus for collecting a set of paraphrases is yet available. Therefore, we obtain a few variations of expression in paraphrase sets when we tried to apply this method with a parallel corpus. In this paper, we propose a grouping method based on the basic idea of grouping synonymous sentences related to the translation recursively, and decompose incorrect groups using the DM-decomposition algorithm. The incorrect groups include expressions that cannot be paraphrased because some words or expressions have different meanings in different situations. We discuss our method and experimental results with respect to BTEC, which is a multilingual parallel corpus.
Synonymous sentence group, parallel corpus, bipartite graph, DM decomposition, paraphrase