Information

LREC WORKSHOP

Data Architectures and Software Support for Large Corpora

Towards an American National Corpus

Athens, Greece
30 May 2000

This workshop has been merged with the
EAGLES/ISLE Workshop on Meta-Descriptions and Annotation Schemas for Multimodal/Multimedia Language Resources.
A full program and description of the workshops and information for authors can be found HERE.

Description

Several software systems for linguistic annotation, search, and retrieval of large corpora have been developed within the natural language processing community over the past several years, including LT-XML (Edinburgh), GATE (Sheffield), IMS Corpus Workbench (Stuttgart), Alembic Workbench (Mitre), MATE (Edinburgh/Odense/Stuttgart), Silfide (Loria/CNRS), SARA (BNC), and several others. Related to and in support of this development, there have also been efforts to develop standards for encoding and various kinds of linguistic annotation, as well as data architectures (e.g., TIPSTER, TalkBank) etc. Still other developments, such as the introduction of XML and the powerful XSL transformation language and work on semi-structured data (e.g., the work of the Lore group at Stanford), have also impacted the ways in which corpora and other linguistic resources can be represented, stored, and accessed.

Approaches to the fundamental design of the formats, data, and tools are varied among current systems for the annotation and exploitation of linguistic corpora. A primary reason for this diversity is that most developers are concerned with only one aspect of the creation/annotation/exploitation process. However, in order to work effectively toward commonality, the phases of the process must be considered as a whole. This demands bringing together researchers and developers from a variety of domains in text, speech, video, etc., many of whom have previously had little or no contact.

This workshop is intended to bring these groups together to look broadly at the technical issues that bear on the development of software systems for the annotation and exploitation of linguistic resources. The goal is to lay the groundwork for the definition of a data and system architecture to support corpus annotation and exploitation that can be widely adopted within the community. Among the issues to be addressed are:

layered data architectures
system architectures for distributed databases
support for plurality of annotation schemes
impact and use of XML/XSL
support for multimedia, including speech and video
tools for creation, annotation, query and access of corpora
mechanisms for linkage of annotation and primary data
applicability of semi-structured data models, search and query systems, etc.
evaluation/validation of systems and annotations

The motivation for this workshop is the American National Corpus (ANC) effort, which should begin corpus creation within the year. We anticipate that the ANC will provide a significant resource for natural language processing, and we therefore seek to identify state-of-the-art methods for its creation, annotation, and exploitation. Also, as a national and freely available resource, the data and system architecture of the ANC is likely to become a de facto standard. We therefore hope to draw together leading researchers and developers to establish a basis for the design of a system to support the creation and use of the ANC.

A "Birds of a Feather" session for those interested in the ANC project will be held immediately following the workshop.

Contact Nancy Ide
Department of Computer Science
Vassar College
Poughkeepsie, New York 12604-0520 USA
Tel : +1 914 437 5988
Fax : +1 914 437 7498
ide@vassar.edu

Last modified 18 April 2000.

ml>

LREC WORKSHOP

Data Architectures and Software Support for Large Corpora

Athens, Greece 30 May 2000

Description

Athens, Greece
30 May 2000