Held in conjunction with the
Second International Conference on Language Resources and Evaluation
Tuesday, 30th May 2000
This workshop seeks to explore how Information Extraction and Corpus Linguistics can each benefit from the techniques of the other.
The goals of information extraction and of corpus linguistics have thus far had little in common. However, both are concerned with processing large bodies of text. It is timely to explore how one can contribute to the other.
IE has developed techniques to extract information based mainly on shallow syntactic and semantic analysis of texts. Corpus linguistics relies currently on at most shallow syntactic analysis to carry out automatic annotation of corpora, although there is growing interest in attempting to automate annotation at higher linguistic levels. IE concentrates on domain-specific texts of similar types, corpus linguistics is often concerned with large designed collections of heterogeneous texts drawn from numerous domains.
The types of annotations that IE systems produce are of interest to corpus linguistics in that they offer a means of augmenting corpora (or sub-parts thereof) with shallow semantic information. Systems exist which allow querying of texts annotated by IE techniques. This holds out the hope that corpora, annotated according to IE techniques, could then be queried more flexibly and in different informative ways than at present.
Corpus linguistics has developed a battery of sophisticated statistical techniques that could contribute to IE tasks, based on e.g. dispersion, association scores, etc. Rule-based IE systems typically do not exploit large-scale corpora (as opposed to mere collections) even in the one domain, to help in the refinement or development of rules, beyond inspection by the human linguist. They thus do not have the benefit of automatically derived corpus evidence to guide them.
IE systems can offer a purpose-oriented view of a collection of data: advances have been made in enabling users to specify the kind of knowledge they wish to be extracted. Systems are able to process large amounts of data at high speed.
With a traditionally-annotated corpus, there is really only one view, that of the original annotators. There has also been at least a tacit assumption that one applies some technique or some annotation to an entire corpus one is working on. That is, in building a treebank of partially analysed sentences, each sentence has been analysed to some degree. However, it seems not unreasonable to suggest that a corpus might be only partially analysed, via IE techniques, where perhaps certain sentences of the corpus may not receive any annotation at all. Although this makes it difficult to have any overall judgement, one however asks if an overall judgement makes any sense in such a context. Typically, with an IE technique, one has some particular goal in mind and is only interested in analysing all parts of some collection that are relevant for that goal. Although care would have to be taken in interpreting such results, with respect to the overall corpus, nevertheless we have something extra to interpret, no matter how partial, and moreover something that represents a particular view for a particular purpose.
There is thus an opportunity for IE techniques to offer corpus linguists customised, partial views of a corpus, even dynamically created ones, given fast computers and cheap storage.
A crucial point for discussion is that of standards: if we are to envisage using IE techniques to investigate corpora and to enrich the annotation of corpora, then it is important to start on a process of arriving at standards for the type of annotations (templates, etc.) produced by IE systems. This is not to say that all IE systems should be obliged to conform to some standard, but rather to say that, if corpus linguists are to benefit from more interestingly annotated corpora, then any IE system used to achieve such ends must ideally be capable of producing some kind of standardised representation in addition to its own private one. This will allow greater reuse of the resulting resources. Corpus linguists have already gone down the standardisation road a long way, thus have much to offer the IE community in terms of experience. How this is achieved is a matter for discussion.
John McNaught (UMIST, UK)
Bill Black (UMIST, UK)
Nicoletta Calzolari (ILC-CNR, Italy)
Luca Gilardoni (Quinary SpA, Italy)
Tony McEnery (University of Lancaster, UK)
Department of Language Engineering
PO Box 88
Manchester M60 1QD
An 800 word abstract in English should be submitted by e-mail to McNaught (firstname.lastname@example.org), in plain ASCII text format. Each submission should show: title; author(s); affiliation(s); and contact author's e-mail address, postal address, telephone and fax numbers.
The final version should not be longer than 4,000 words or 10 A4 pages.
Instructions for formatting and presentation of the final version will be sent to authors upon notification of acceptance.
An accepted paper should not have been presented at another meeting.
Receipt of submissions will be acknowledged.
Facilities for overhead projection and for SGVA data display will be available.
The registration fee for the workshop is:
The fee includes a coffee break and the workshop proceedings.
Participation in the workshop is limited by the venue. Requests for participation will be processed on a first come first served basis. Registration is handled by the LREC Secretariat.
General information on LREC: http://www.lrec-conf.org/index.html
Specific queries about the conference and registration for the workshop:
6, Artemidos & Epidavrou Str
Tel: +30 1 6800959
Fax: +30 1 6856794