In the framework of the SpeechDat (II) project (LE2-4001), which was funded by the EC, a large telephone speech corpus has been collected and processed. Recording was done using an ISDN telephone interface, yielding 8 KHz, 8 bit/sample A-law coded signals. The corpus contains the speech of 4000 speakers (about half male and half female). Equivalent corpora have been collected for other European languages.
The corpora are designed to support the creation of voice driven teleservices. The callers spoke 40 items, comprising isolated and connected digits, natural numbers, money amounts, spellings, time and data phrases, confirmation/rejections, forenames and surnames, city names, company names, common applications words, application words in phrases and phonetically rich sentences. Most items are read, some are spontaneously spoken. The recordings come with extensive and standardised documentation. All speech is carefully transcribed on the orthographic level; in addition, a number of clearly audible non – speech events are included in the transcription. Moreover, age and regional background of the speakers are provided. A pronunciation dictionary is added, containing all words that occur in the corpus, with a corresponding SAMPA broad - class phonemic transcription. The data files are formatted according to the ESPRIT Project SAM standards.
Each call within the database will consist of a total of 40 items.
# Item Corpus contents 3 3 application words 1 1 sequence of 10 isolated digits 1 1 sheet number (6 digits) 1 1 telephone number (9-11 digits) 4 connected digits 1 1 credit card number (14-16 digits) 1 1 PIN code (6 digits) (set of 150) 1 1 spontaneous date, e.g. birthday 1 1 prompted date, word style 3 dates 1 1 relative and general date exp. 1 1 word spotting phrase using an application word (embedded) 1 1 isolated digit 1 1 spelling of surname (same as O1) 1 1 spelling of direct. city name (O3) 3 spelled word 1 1 real/artificial for coverage (letter sequences) 1 1 currency money amount 1 1 natural number 1 1 surname (set of 500) 1 1 city of growing up (spontaneous) 1 1 most frequent cities (set of 500) 1 1 most frequent company/agency (set of 500) 1 1 forename surname (set of 150 ) 1 1 predominantly yes question 2 questions, including 1 1 predominantly no question fuzzy yes/no 9 9 phonetically rich sentences 1 1 time of day (spontaneous) 2 time phrases 1 1 time phrase (word style) 4 4 phonetically rich words
We have grouped the different regions of Spain in five phonetically relevant regions. These regions have been defined based on dialectal and bilingual criteria. The total number of speakers is 4000 half males and half females. The map shows the defined dialectal regions, the places were speakers were recruited and the total number of speakers recruited in each region
The distribution of speakers over age groups and sexes is shown in the next table. The documentation file SESSION.TBL contains information about the accent region, age and sex of the speakers.
Age groups Number of speakers Male Fem. Total under 16 15 27 42 16-30 1093 1141 2234 31-45 404 440 844 46-60 361 403 764 over 60 66 50 116 Total 2061 1939
Speech files are stored as sequences of 8-bit 8 kHz A-law uncompressed speech samples (CCITT G.711 recommendation). Each prompted utterance is stored within a separate file. Each speech file has an accompanying ASCII SAM label file.
Recordings took place at UPC. Two recording platforms were used simultaneously. The main characteristics of each recording platform are:
Interface: ISDN basic access (BRI) Board: AVM-ISDN-A1. Computer:
Pentium PC at 120 MHz,
32 MB RAM 4 GBytes SCSI Hard disk.
PCI Network card
DOS: Windows 95. Programming Interface: COMMON-ISDN-API Version 2.0 (CAPI 2.0) Software: Application Software written in C (UPC ADA program) Lines: 2
The transcription included in this database is an orthographic, lexical transcription with a few details that represent audible acoustic events (speech and non speech) present in the corresponding waveform files. The extra marks contained in the transcription aid in interpreting the text form of the utterance. Transcriptions were made in two passes: one pass in which words are transcribed, and a second pass in which the additional details are added.
Extra marks point to mispronuntiation, truncations, uninteligible words and extra noises. Symbols for extra noises are:
- [fil]: Filled pause.
- These sounds can well be modelled in a filled pause model in speech recognisers. Examples of filled pauses: uh, um, er, ah, mm.
- [spk]: Speaker noise.
- All kinds of sounds and noises made by the calling speaker that are not part of the prompted text, e.g. lip smack, cough, grunt, throat clear, tongue click, loud breath, laugh, loud sigh.
- [sta]: Stationary noise
- This category contains background noise that is not intermittent and has a more or less stable amplitude spectrum. Examples: car noise, road noise, channel noise, GSM noise, voice babble (cocktail-party noise), public place background noise, street noise.
- [int]: Intermittent noise.
- This category contains noises of an intermittent nature. These noises typically occur only once (like a door slam), or have pauses between them (like phone ringing), or change their colour over time (like music). Examples: music, background speech, baby crying, phone ringing, door slam, doorbell, paper rustle, cross talk.
The Spanish Database has been transcribed using the software tool UPCRevBD.v1, developed at UPC. A 1% of the transcriptions has been supervised by UPC. The database has been supervised and validated by SPEX.
The lexicon is included in the documentation. The lexicon file is an alphabetically ordered list of distinct lexical items (essentially words in our case) which occur in the corpus with the corresponding pronunciation information. Each distinct word has a separate entry. As the lexicon is derived from the corpus it uses the same alphabetic encoding for special and accented characters as used in the transcriptions (ISO-8859). The file includes a frequency count for each entry in the lexicon.
The pronunciation lexicon was produced after the transcription phase; it contain, alphabetically sorted, all words found in the "LBO:" transcription (one occurrence for each word), their number of occurrences and the list of their phonemic representations. The words appear in the lexicon exactly as they appear in the transcription. The lexicon is case insensitive.
All the component words have been identified and alphabetically sorted; all fragments, mispronunciations and non speech events have been removed, and only one occurrence of each word have been selected.
A software tool developed at UPC (SAGA: Spanish Automatic Graphemes to Allophones Transcriber) has been used to translate the transcribed words to phonemic strings by using the SAMPA phonemic notation. The lexicon was transcribed automatically. Proper names and company names were checked manually.
The set of environments where the database has been collected is: HOME/OFFICE, TELEPHONE_BOOTH, VEHICLE, PUBLIC PLACE.
You can download this speech file.
Accompanying ASCII SAM label file to the speech file.
This database is commercially available.
SpeechDat II FDB-4000 speakers. With validation report from SPEX
SpeechDat II FDB-1000 speakers. Subset with validation report from SPEX