Title

An Efficient and Flexible Format for Linguistic and Semantic Annotation

Authors

Špela Vintar (DFKI GmbH  Stuhlsatzenhausweg 3, 66123 Saarbrücken, Germany)

Paul Buitelaar (DFKI GmbH  Stuhlsatzenhausweg 3, 66123 Saarbrücken, Germany)

Bärbel Ripplinger (Eurospider Information Technology AG Schaffhauserstrasse 18 CH-8006 Zürich, Switzerland)

Bogdan Sacaleanu (DFKI GmbH  Stuhlsatzenhausweg 3, 66123 Saarbrücken, Germany)

Diana Raileanu (DFKI GmbH  Stuhlsatzenhausweg 3, 66123 Saarbrücken, Germany)

Detlef Prescher (DFKI GmbH  Stuhlsatzenhausweg 3, 66123 Saarbrücken, Germany)

Session

WP4: Corpus Annotation

Abstract

The paper describes an XML annotation format and tool developed within the MUCHMORE project. The annotation scheme was designed specifically for the purposes of Cross-Lingual Information Retrieval in the medical domain so as to allow both efficient and flexible access to layers of information. We use a parallel English-German corpus of medical abstracts and annotate it with linguistic information (tokenisation, part-of-speech tagging, lemmatisation and decomposition, phrase recognition, grammatical functions) as well as semantic information from various sources. The annotation of medical terms/concepts, semantic types and semantic relations is based on the Unified Medical Language System (UMLS). Additionally, we use EuroWordNet as a general-language resource in annotating word senses and to compare domain-specific and general language use. A major aim of the project is also to complement existing ontological resources by extracting new terms and new semantic relations. We present the annotation scheme, which is conceptually related to stand-off annotation, and describe our tool for automatic semantic annotation.

Keywords

Flexible format, Tools

Full Paper

167.pdf