The DASL Project: a Case Study in Data Re-Annotation and Re-Use
Christopher Cieri (University of Pennsylvania and Linguistic Data Consortium 3615 Market Street, Philadelphia, PA 19104-2608 U.S.A.)
Stephanie Strassel (University of Pennsylvania and Linguistic Data Consortium 3615 Market Street, Philadelphia, PA 19104-2608 U.S.A.)
SP2: Speech Varieties And Multilingual ASR
It is well known and often repeated that publicly available digital data encourages basic and collaborative research including the comparison of results across studies, the measurement of inter-annotator consistency and the use of stable data as a benchmark with which to compare new models and methodologies. Instances of such reuse abound. The reuse and re-annotation of the Switchboard and TDT corpora was described in detail during LREC 2000 (Graff and Bird 2000). Unfortunately, very few studies have actually focused on the issues surrounding re-use and re-annotation of data. The LDC project to develop Data and Annotations for Sociolinguists (DASL) encourages data sharing and the re-annotation and reuse of published data as an important complement to firsthand fieldwork. DASL annotators use a tool, developed for the project, that gives linguists access to the four corpora via the Internet and allows simultaneous annotation at multiple sites. In addition to the empirical study of linguistic variation among the speakers represented, this project will address methodological issues in the corpus re-use and in team based annotation of linguistic data. The paper will describe the tools, data and data formats developed for DASL, outline the challenges we have faced in re-annotating the data using a team approach and summarize the results to date.