The American National Corpus First Release
Nancy Ide, Keith Suderman
Department of Computer Science Vassar College Poughkeepsie, NY 12604-0520 USA
The First Release of the American National Corpus (ANC) was made available in mid-fall, 2003. The data includes approximately 11 million words of American English, including written and spoken data and a variety of text types annotated for part of speech and lemma. The corpus is provided in XML format conformant to the XML Corpus Encoding Standard (XCES) (http://www.xml-ces.org), and is distributed in both a stand-off version (where annotation is in an XML document separate from the primary texts) and a merged version (where annotation is included in-line in the texts). The merged version includes annotation for part of speech and lemma produced by the Biber tagger; in stand-off annotation, in addition to the Biber tagging, morpho-syntactic annotations of the data are provided using the CLAWS 5 and 7 tagsets as well as several other tagsets.
corpus, American English, corpus annotation