Automatically selecting domain markers for terminology extraction


Jorge Vivaldi (1), Horacio Rodríguez (2)

(1) Institute for Applied Linguistics, Universitat Pompeu Fabra, La Rambla 30-32, 08002 Barcelona, Spain, jorge.vivaldi@upf.edu; (2) Software Department, Universitat Politècnica de Catalunya, c/ Jordi Girona 31, 08034 Barcelona, Spain, horacio@lsi.upc.es




Some approaches to automatic terminology extraction from corpora imply the use of existing semantic resources for guiding the detection of terms. Most of these systems exploit specialised resources, like UMLS in the medical domain, while a few try to take profit from general-purpose semantic resources, like EuroWordNet (EWN). As the term extraction task is clearly domain depending, in the case a general-purpose resource without specific domain information is used, we need a way of attaching domain information to the units of the resource. For big resources it is desirable that this semantic enrichment could be carried out automatically. Given a specific domain, our proposal aims to detect in EWN those units that can be considered as domain markers (DM). We can define a DM as an EWN entry whose attached strings belong to the domain, as well as the variants of all its descendents through the hyponymy relation. The procedure we propose in this paper is fully automatic and, a priori, domain-independent. The only external knowledge it uses is a set of terms, which is an external vocabulary, which is considered to have at least one sense belonging to the domain.


Terminology, term extraction, domain markers detection



Full Paper