TTS – A Treebank Tool Suite
Aoife Cahill (Dublin City University Glasnevin, Dublin 9, Ireland)
Josef van Genabith (Dublin City University Glasnevin, Dublin 9, Ireland)
WP4: Corpus Annotation
Treebanks are important resources in descriptive, theoretical and computational linguistic research, development and teaching. This paper presents a treebank tool suite (TTS) for and derived from the Penn-II treebank resource (Marcus et al, 1993). The tools include treebank inspection and viewing options which support search for CF-PSG rule tokens extracted from the treebank, graphical display of complete trees containing the rule instance, display of subtrees rooted by the rule instance and display of the yield of the subtree (with or without context). The search can be further restricted by constraining the yield to contain particular strings. Rules can be ordered by frequency and the user can set frequency thresholds. To process new text, the tool suite provides a PCFG chart parser (based on the CYK algorithm) operating on CFG grammars extracted from the treebank following the method of (Charniak, 1996) as well as a HMM bi-/trigram tagger trained on the tagged version of the treebank resource. The system is implemented in Java and Perl. We employ the InterArbora module based on the Thistle display engine (LTG, 2001) as our tree grapher.