Producing a Large-scale Encyclopedic Corpus over the Web


Atsushi Fujii (University of Library and Information Science) 

Katunobu Itou (National Institute of Advanced Industrial Science and Technology)

Tetsuya Ishikawa (University of Library and Information Science)


WP4: Corpus Annotation


Encyclopedias, which describe general/technical terms, are valuable language resources (LRs). As with other types of LRs relying on human introspection and supervision, constructing encyclopedias is quite expensive. To resolve this problem, we automatically produced a large-scale encyclopedic corpus over the World Wide Web. We first searched the Web for pages containing a term in question. Then we used linguistic patterns and HTML structures to extract text fragments describing the term. Finally, we organized extracted term descriptions based on domains. The resultant corpus contains approximately 100,000 terms. We also evaluated the quality of 2,000 test terms, and found that correct descriptions were obtained for 65\% of test terms.


Encyclopedias, Corpus building, World wide web, Information retrieval, Information organization

Full Paper