On the Spoken Corpus of the Budapest Sociolinguistic Interview

Tamas Varadi

The Budapest Sociolinguistic Interview (BSI) project is a long-term sociolinguistic project aiming to provide solid empirical data about the language variaties of spoken in Budapest. About 250-300 hours of tape recorded data was collected in a carefully compiled sociolinguistic invterview which was administered to a 250 strong representative sample of Budapest speakers.

The research topics spanned phonetics/phonological to lexical questions and were investigated with the help of a battery of tasks ranging from oral sentence completion, reading at normal and fast speed to at least half an hour of guided conversations. The project cannot afford to undertake a minute analysis or even phonetic transcript of such a mass of accoustic material but the handful of phonetic issues (consonant deletion, vowel length, compensatory lenghtening etc.) are coded in the orthographic transcript.

The tape recording of one complete interview has recently been digitized and a sample CD has been prepared which includes a common HTML graphical interface to the transcript and the sound files. The alignment of the sound data with the transcript is at the level of the functional units of the interview, which may range from minimal pairs or individual sentences to a whole conversation module. (The talk will be accompanied with a demonstration of the CD.)

The transcription and the coding of the whole material is at its early stage as yet but in an effort to provide safe storage and accessibility to the taped material we are planning to go ahead with the digitization of the whole of the taped material. This work is expected to be complete in two years time, yielding a large-scale (demographically) representative spoken corpus of the Budapest variety of Hungarian in digitized form. This should serve as a valuable resource to any future projects that seek to target representative spontaneous speech.

