Title

The Reuters Corpus Volume 1 - from Yesterday's News to Tomorrow's Language Resources

Authors

Tony Rose (Technology Innovation Group, Reuters Limited, 85 Fleet Street, London EC4P 4AJ)

Mark Stevenson (Technology Innovation Group, Reuters Limited, 85 Fleet Street, London EC4P 4AJ)

Miles Whitehead (Technology Innovation Group, Reuters Limited, 85 Fleet Street, London EC4P 4AJ)

Session

WO8: Written Corpora

Abstract

Reuters, the global information, news and technology group, has for the first time made available free of charge, large quantities of archived Reuters news stories for use by research communities around the world. The Reuters Corpus Volume 1 (RCV1) includes over 800,000 news stories - typical of the annual English language news output of Reuters. This paper describes the origins of RCV1, the motivations behind its creation, and how it differs from previous corpora. In addition we discuss the system of category coding, whereby each story is annotated for topic, region and industry sector. We also discuss the process by which these codes were applied, and examine the issues involved in maintaining quality and consistency of coding in an operational, commercial environment.

Keywords

Corpus, Annotation, Coding, Consistency, News, Metadata

Full Paper

80.pdf