LREC 2000 2nd International Conference on Language Resources & Evaluation

Previous Paper   Next Paper

Title Semantico-syntactic Tagging of Very Large Corpora: the Case of Restoration of Nodes on the Underlying Level
Authors Hajičová Eva (Faculty of Mathematics and Physics, Charles University, Malostranské námêstí 25, 1180 Praha 1, Czechia,
Sgall Petr (Faculty of Mathematics and Physics, Charles University, Malostranské námêstí 25, 1180 Praha 1, Czechia,
Keywords Corpus, Deletions, Dependency, Syntax
Session Session WO2 - Treebanks
Full Paper, 18.pdf
Abstract The Prague Dependency Treebank has been conceived of as a semi-automatic three-layer annotation system, in which the layers of morphemic and 'analytic' (surface-syntactic) tagging are followed by the layer of tectogrammatical tree structures. Two types of deletions are recognized: (i) those licensed by the grammatical properties of the given sentence, and (ii) those possible only if the preceding context exhibits certain specific properties. Within group (i), either the position itself in the sentence structure is determined, but its lexical setting is 'free' (as e.g. with a deleted subject in Czech as a pro-drop language), or both the position and its 'filler' are determined. Group (ii) reflects the typological differences between English and Czech; the rich morphemics of the latter is more favorable for deletions. Several steps of the tagging procedure are carried out automatically, but most parts of the restoration of deleted nodes still have to be done ''manually''. If along with the node that is being restored, also nodes depending on it are deleted, then these are restored only if they function as arguments or obligatory adjuncts. The large set of annotated utterances will make it possible to check and amend the present results, also with applications of statistic methods. Theoretical linguistics will be enabled to check its descriptive framework; the degree of automation of the procedure will then be raised, and the treebank will be useful for most different tasks in language processing.