Design, Collection, and Annotation of a Romanian Speech Database

Marian Boldea, Cosmin Munteanu, Alin Doroga

Undertaken as part of the BABEL project, the design, collection, and annotation of a Romanian speech database is now in its final stage, and this paper will give a general overview of the whole process.

Intended mainly for training, testing and evaluating speaker independent continuous speech recognition systems, the database was designed starting from the existing EUROM-1 database in order to comply with the pre-normalization and standards objectives of the COPERNICUS programme, with special emphasis on: an integrated (re)design of various components as found in EUROM-1 (read passages, filler sentences, numbers, CVC words) to obtain a more systematic satisfaction of their aims; a speaker population of minimum 60 persons, with a uniform sex and age group distribution, extensible beyond this minimum limit, and structured in Many Talkers, Few Talkers, and Very Few Talkers sets similar to EUROM-1; extending it with semispontaneous material.

The 40 prompt passages in the English version of EUROM-1, translated and adapted, were grouped in 10 clusters by a heuristic algorithm trying to even the phonemic distributions across clusters, each cluster being assigned to a certain number (minimum three) of speakers of each sex so that an extensible speaker population could be recorded in 20 speakers increments.

Because some phonemes were poorly represented, two or three filler sentences were added to each cluster so that a minimum number of occurences (7) per extended cluster could be expected for all phonemes.

The numbers part, which in EUROM-1 consists of the same 100 numbers for all languages, was reduced to a 26 numbers set checked to satisfy the phonotactics coverage originally intended by EUROM-1, and the CVC words, in isolation and in contexts, were adapted to the Romanian phonological system.

Four common phonemically compact sentences were added to be labeled by hand and used to initialize phoneme HMMs for an automatic alignment system, and individual sentences (3-7 per speaker) were automatically selected by a greedy algorithm from a manually pre-processed text corpus for a better diphones coverage.

Additional semispontaneous material was planed to be collected by requests for very simple personal information (name - spoken and spelled; ID - two letters and six digits; telephone number; birth date; address), and a reading of the Romanian alphabet was included for comparisons with name spelling and ID letters pronunciation.

Extended along more than one year, the recordings took place in a sound proof room using a PC-compatible computer placed in an adjacent recording control room. The PC was configured as a SAM workstation using an OROS AU-21 board and running the EUROPEC data collection package. A SONY ECM-44B condenser microphone placed about 25 cm from speaker's mouth, 30 degrees off axis, and connected to the OROS board through a preamplifier was used for recordings, and an operator-controlled intercom system was used to instruct the speakers.

For every speaker, the recordings were started with the semispontaneous items and the alphabet, recorded twice, the first to get the speaker aquainted with the operating procedures, and only the second to be preserved, folowed by the read material in the sequence: passages; filler, individual, and phonemically compact sentences; numbers, and CVC words.

After recordings, every file was checked for DC bias, signal clipping, signal and noise levels, signal-to-noise ratio, and mains-related noise components, in order to maintain consistent quality during all the recording period.

Recordings stopped at 100 speakers, among which there are ten Few Talkers, and two Very Few Talkers, and three CD-ROMs were written holding all the collected data.

Still in progress, this consists mainly of transcribing the signal files at the broad phonetic level, automatically alligning the phonemes strings with the signals based on Viterbi segmentation, and manually checking and correcting the automatically aligned labels.

The transcription was done listening every file signal and, where considered necessary, examining the signal waveform, using the standard Romanian phoneme inventory, and taking into account connected speech phenomena (assimilations, ellisions, epentheses).

The phonemically compact sentences were labeled manually at the broad phonetic level based on waveform and spectrogram visualization, and used to initialize gender-dependent phoneme HMMs, subsequently trained using a concatenated Baum-Welch algorithm and all transcribed files.

The results of the Viterbi segmentation used for automatic alignment are checked and corrected by hand based on waveform, spectrogram, and labels visualization, and selective listening to the signal, and label files are generated in SAM format.

So far, four passages and the associated filler sentences from all 100 speakers were transcribed and automatically aligned, a quantitative evaluation of the annotation procedure will be conducted as soon as the manual verification and correction is completed for this part of the database, and its extension to the rest of the collected data is intended for the future.

Back to Programme ml>