Up: UMIST LE Events page

Information Extraction meets Corpus Linguistics
Pre-Conference Workshop

Held in conjunction with the
Second International Conference on Language Resources and Evaluation
(LREC 2000)
Athens, Greece

Tuesday, 30th May 2000

Workshop description

This workshop seeks to explore how Information Extraction and Corpus Linguistics can each benefit from the techniques of the other.

The goals of information extraction and of corpus linguistics have thus far had little in common. However, both are concerned with processing large bodies of text. It is timely to explore how one can contribute to the other.

IE has developed techniques to extract information based mainly on shallow syntactic and semantic analysis of texts. Corpus linguistics relies currently on at most shallow syntactic analysis to carry out automatic annotation of corpora, although there is growing interest in attempting to automate annotation at higher linguistic levels. IE concentrates on domain-specific texts of similar types, corpus linguistics is often concerned with large designed collections of heterogeneous texts drawn from numerous domains.

The types of annotations that IE systems produce are of interest to corpus linguistics in that they offer a means of augmenting corpora (or sub-parts thereof) with shallow semantic information. Systems exist which allow querying of texts annotated by IE techniques. This holds out the hope that corpora, annotated according to IE techniques, could then be queried more flexibly and in different informative ways than at present.

Corpus linguistics has developed a battery of sophisticated statistical techniques that could contribute to IE tasks, based on e.g. dispersion, association scores, etc. Rule-based IE systems typically do not exploit large-scale corpora (as opposed to mere collections) even in the one domain, to help in the refinement or development of rules, beyond inspection by the human linguist. They thus do not have the benefit of automatically derived corpus evidence to guide them.

IE systems can offer a purpose-oriented view of a collection of data: advances have been made in enabling users to specify the kind of knowledge they wish to be extracted. Systems are able to process large amounts of data at high speed.

With a traditionally-annotated corpus, there is really only one view, that of the original annotators. There has also been at least a tacit assumption that one applies some technique or some annotation to an entire corpus one is working on. That is, in building a treebank of partially analysed sentences, each sentence has been analysed to some degree. However, it seems not unreasonable to suggest that a corpus might be only partially analysed, via IE techniques, where perhaps certain sentences of the corpus may not receive any annotation at all. Although this makes it difficult to have any overall judgement, one however asks if an overall judgement makes any sense in such a context. Typically, with an IE technique, one has some particular goal in mind and is only interested in analysing all parts of some collection that are relevant for that goal. Although care would have to be taken in interpreting such results, with respect to the overall corpus, nevertheless we have something extra to interpret, no matter how partial, and moreover something that represents a particular view for a particular purpose.

There is thus an opportunity for IE techniques to offer corpus linguists customised, partial views of a corpus, even dynamically created ones, given fast computers and cheap storage.

A crucial point for discussion is that of standards: if we are to envisage using IE techniques to investigate corpora and to enrich the annotation of corpora, then it is important to start on a process of arriving at standards for the type of annotations (templates, etc.) produced by IE systems. This is not to say that all IE systems should be obliged to conform to some standard, but rather to say that, if corpus linguists are to benefit from more interestingly annotated corpora, then any IE system used to achieve such ends must ideally be capable of producing some kind of standardised representation in addition to its own private one. This will allow greater reuse of the resulting resources. Corpus linguists have already gone down the standardisation road a long way, thus have much to offer the IE community in terms of experience. How this is achieved is a matter for discussion.

Key topics (indicative list)

Organising committee

John McNaught (UMIST, UK)
Bill Black (UMIST, UK)
Nicoletta Calzolari (ILC-CNR, Italy)
Luca Gilardoni (Quinary SpA, Italy)
Tony McEnery (University of Lancaster, UK)


Introduction to workshop (organisers)
Thierry Declerck & Günter Neumann, Language Technology Lab, DFKI GmbH, Saarbrücken, Germany
Using a parameterisable and domain-adaptive information extraction system for annotating large-scale corpora?
Andrea Setzer & Robert Gaizauskas, Dept of Computer Science, University of Sheffield, UK
Building a temporally annotated corpus for information extraction
Gökhan Tür, Dilek Z. Hakkani-Tür & Kemal Oflazer, Dept of Computer Engineering, Bilkent University, Ankara, Turkey
Name tagging using lexical, contextual and morphological information
Mark Stevenson & Robert Gaizauskas, Dept of Computer Science, University of Sheffield, UK
Improving named entity recognition using annotated corpora
Coffee break
Roman Yangarber & Ralph Grishman, Computer Science Dept, New York University, New York, USA
Extraction pattern discovery through corpus analysis
Jakub Zavrel, Departement Germaanse, CNTS/University of Antwerp, Antwerp, Belgium
Peter Berck & Willem Lavrijssen, Stichting Toepassing Inductieve Leertechnieken (STIL), Tilburg, The Netherlands
Information extraction by text classification: Corpus mining for features
Panel session

Important dates

Deadline for workshop abstract submission
22nd January 2000

Notification of acceptance
25th February 2000

Final version of paper for workshop proceedings
9th April 2000

30th May 2000

Contact person for the workshop

John McNaught
Department of Language Engineering
PO Box 88
Sackville Street
Manchester M60 1QD

E-mail: jock@ccl.umist.ac.uk
Tel: +
Fax: +


An 800 word abstract in English should be submitted by e-mail to McNaught (jock@ccl.umist.ac.uk), in plain ASCII text format. Each submission should show: title; author(s); affiliation(s); and contact author's e-mail address, postal address, telephone and fax numbers.

The final version should not be longer than 4,000 words or 10 A4 pages.

Instructions for formatting and presentation of the final version will be sent to authors upon notification of acceptance.

An accepted paper should not have been presented at another meeting.

Receipt of submissions will be acknowledged.

Technical support

Facilities for overhead projection and for SGVA data display will be available.

Workshop registration

The registration fee for the workshop is:

The fee includes a coffee break and the workshop proceedings.

Participation in the workshop is limited by the venue. Requests for participation will be processed on a first come first served basis. Registration is handled by the LREC Secretariat.

Conference information

General information on LREC: http://www.lrec-conf.org/index.html

Specific queries about the conference and registration for the workshop:

LREC Secretariat
6, Artemidos & Epidavrou Str
15125 Marousi

E-mail: LREC2000@ilsp.gr
Tel: +30 1 6800959
Fax: +30 1 6856794

Up: UMIST LE Events page

Last revised 18th April 2000. © UMIST