Bootstrapping a database of German multi-word expressions


Alexander Geyken

Berlin-Brandenburgische Akademie der Wissenschaften, Jägerstr. 22/23, 10117 Berlin, www.dwds.de, geyken@bbaw.de




We pre-classified 32,000 entries from the {Wörterbuch der deutschen Idiomatik} (Schemann 1993) using an inductive description of POS sequences in conjunction with a Brill Tagger trained on manually tagged idiomatic entries. This process assigned categories to 86% of entries with 88% accuracy. Further manual classification resulted in a database of multi-word expressions where each entry is associated with a sequence of POS-tag/token pairs. The second phase of our project, currently underway, addresses the association of a sequence of POS-tag/token pairs with a corpus example. To this end, we generate a weighted finite state transducer from the sequences for each entry and apply a finite state filter to the corpus. The filter will extract those sequences in the corpus that correspond to the longest match of the multi-word expression.


multi-word expressions, collocations, database, acquisition, finite state filter

Language(s) German
Full Paper