Terminal Device Oriented Comparable Corpora and its Alignment -- Towards Extracting Paraphrasing Patterns --
Hiroshi Nakagawa (1), Hidetaka Masuda (2), Dai Sato (2)
(1) Information Technology Center, The University of Tokyo; (2) Tokyo Denki University
Many terminal devices for mobile environment such as mobile phones have small and low resolution screens compared to the big and high resolution screen of personal computers. In this circumstance, Web pages for ordinary personal computer and mobile phones written in the same language are developed separately even though they describe the same topic or contents. In this research, we collected Web news articles aimed at displaying on personal computer screens and news articles aimed at mobile terminals for more than two years. Then we aligned these two kinds of news articles first in article level and then in sentence level. As the result, we got more than 88,000 pairs of aligned sentences. Next, we extract paraphrases of the final part of sentences from this aligned corpus. Actual results are the sentence final nouns of mobile article sentences and their counterpart expressions of Web article sentences. We extract character strings for paraphrases based on branching factor, frequency and length of string. The precision is 90% for highest ranked candidate and 80% for each top four candidates of 10 most frequently used nouns.
Paraphrase, Web, news, mobile terminal