The American National Corpus First Release


Nancy Ide, Keith Suderman

Department of Computer Science Vassar College Poughkeepsie, NY 12604-0520 USA




The First Release of the American National Corpus (ANC) was made available in mid-fall, 2003. The data includes approximately 11 million words of American English, including written and spoken data and a variety of text types annotated for part of speech and lemma. The corpus is provided in XML format conformant to the XML Corpus Encoding Standard (XCES) (http://www.xml-ces.org), and is distributed in both a stand-off version (where annotation is in an XML document separate from the primary texts) and a merged version (where annotation is included in-line in the texts). The merged version includes annotation for part of speech and lemma produced by the Biber tagger; in stand-off annotation, in addition to the Biber tagging, morpho-syntactic annotations of the data are provided using the CLAWS 5 and 7 tagsets as well as several other tagsets.


corpus, American English, corpus annotation

Language(s) American English
Full Paper