Spanish Speech Recognition Resources

This database comprises recordings from 306 speakers recorded in 600 different sessions. Speech signals were recorded in a car and simultaneously transmitted by GSM and recorded in a fixed platform connected to an ISDN line.

 

The SpeechDat Car Spanish Database was recorded within the scope of the SpeechDat Car project (LE4-8334) which was sponsored by the European Commission and the Spanish Government.

 

Collection was performed at the Department of Signal Theory and Communications of the Technical University of Catalonia (UPC) (Spain) with the collaboration of SEAT and Volkswagen. The owner of the database is UPC.

 

 

Definition of the database content

 

The following table shows the contents and corpus codes of the SpeechDat car Spanish Database. All items are read, unless marked as spontaneous.

 

 taula

Speakers

 

Spain has a population of 38 Million people. The official language is Spanish (Castiliano) and some regions have also other official languages as Catalan, Galiziano and Basque. Due to the limited number of speakers to be recorded, the number of regions is small. The number of dialectal regions has been defined taking into account phonetic differences among regions and four groups are defined:

 
 

 

Region

Description

NORTH WEST

Galicia, Asturias

CENTER

Aragon, Cantabria, Castilla_La_Mancha, Castilla_León, La_Rioja, Madrid, Extremadura (North), País Vasco, Navarra

SOUTH

Andalucía, Canarias, Extremadura (South),  Murcia

EAST

Cataluña, Valencia, Baleares

 
 
 


The distribution of sessions recorded as function of the accent region of the recorded speakers is shown in the next table.

 

 

Number

Name of

 accent/region

Number of

 speakers

Number of

 sessions

Number of

sessions (%)

1

NORTHWEST

53

106

17.6

2

CENTER

78

154

26

3

SOUTH

54

105

17.1

4

EAST

121

235

39.3

TOTAL

 

306

600

100

 
 
 


The total number of different speakers is 306. 149 are female and 157 are male speakers. Next table shows the number of sessions spoken by females and males speakers and their age groups

 

 

Age groups

Male speakers

Female speakers

Percentage of total

Number

Sessions

Number

Sessions

Speakers

Sessions

18-30

84

165

76

150

52

52.1

31-45

41

80

39

75

26.1

26.1

46-60

30

59

35

69

21.6

21.5

over 60

1

2

0

0

0.3

0.3

TOTAL

156

306

150

294

100

100

 
 

Recording Platforms

 

Two types of recordings compose the database. First, wideband recordings (60-7000 Hz) were performed for systems which are installed and operate in the car itself; second, narrow band recordings (300-3400 Hz) were performed for systems that operate centrally outside the car and obtain their spoken input from the driver over the cellular telephone network. Two recording platforms were used

  • A ‘mobile’ recording platform (PltM) installed inside the car, recording multi-channel speech utterances in a high bandwidth mode (16 kHz sample frequency).
  • A ‘fixed’ recording platform (PltF) located at the far-end fixed side of the GSM communications simultaneously recording the speech utterances coming from the car (8 kHz sample frequency, A law encoding).

Multi-channel recordings were performed simultaneously in the car and through the GSM network.

 

Recording conditions

 

 

There are defined 7 environment conditions:

 
  • car stopped by motor running, CEQ: no restrictions.
  • car in town traffic, CEQ: everything set to off or close
  • car in town traffic, CEQ: with noisy conditions
  • car moving at a low speed with rough road conditions, CEQ: everything set to off or close
  • car moving at a low speed with rough road conditions and CEQ: with noisy conditions
  • car moving at a high speed with good road conditions CEQ: no restrictions.
  • car moving at a high speed with good road conditions and with audio equipment on  no further restrictions.

In addition, some information was collected during the recordings:

  • Weather conditions : rain, sun chine, wind …
  • Accessories used during recordings: windscreen wipers, ventilation, fan, radio …
  • Level of fan: on/off

Transcription

 

The transcription included in this database is an orthographic, lexical transcription with a few details that represent audible acoustic events (speech and non speech) present in the corresponding waveform files. The extra marks contained in the transcription aid in interpreting the text form of the utterance. Transcriptions were made in two passes: one pass in which words are transcribed, and a second pass in which the additional details are added. Transcriptions are CASE INSENSITIVE.

 

Non-Speech Acoustic Events have been arranged into 5 categories and transcribed. Events only are transcribed if they are clearly distinguishable. Very low-level, non-intrusive events are ignored. The event will be transcribed at the place of occurrence, using the defined symbols in square brackets. For noise events that occur over a span of one or more words, the transcription indicates the beginning of the noise, just before the first word it affects.

 

The first two categories of acoustic events originate from the speaker, and the other three categories originate from another source. The 5 categories are:

 

[fil]: Filled pause. These sounds can well be modeled in a filled pause model in speech recognisers. Examples of filled pauses: uh, um, er, ah, mm.

 

[spk]: Speaker noise. All kinds of sounds and noises made by the calling speaker that are not part of the prompted text, e.g. lip smack, cough, grunt, throat clear, tongue click, loud breath, laugh, loud sigh.

 

[sta]: Stationary noise. This category contains background noise that is not intermittent and has a more or less stable amplitude spectrum. Examples voice babble (cocktail-party noise), sirens, wind, rain, cobble stones.

 

[int]: Intermittent noise. This category contains noises of an intermittent nature. These noises typically occur only once (like a door slam), or have pauses between them (like phone ringing), or change their color over time (like music). Examples: music, background speech, baby crying, phone ringing, door slam, door bell, paper rustle, cross talk, ticks by the direction indicator.

 

[dit]: DTMF and prompt tone. In fact this is a special case of [int]. But since this sound can be expected to be present in nearly each speech file, a special symbol was defined.

 

Only signals from microphone 0 have been transcribed. All the signals contain the prompt beep.

 

The Lexicon

 

The database includes a lexicon. The lexicon file is an alphabetically ordered list of distinct lexical items (essentially words in our case) which occur in the corpus with the corresponding pronunciation information. Each distinct word has a separate entry. As the lexicon is derived from the corpus it uses the same alphabetic encoding for special and accented characters as used in the transcriptions (ISO-8859). We have included a frequency count for each entry in the lexicon e.g. to help indicate rare words whose transcriptions are perhaps less important or reliable.

 

The pronunciation lexicon was produced after the transcription phase; it contain, alphabetically sorted, all words found in the transcription (one occurrence for each word), their number of occurrences and the list of their phonemic representations. The words appear in the lexicon exactly as they appear in the transcription. All the component words have been identified and alphabetically sorted; all fragments, mispronunciations and non speech events have been removed, and only one occurrence of each word have been selected.

 

A software tool developed at UPC (SAGA: Spanish Automatic Graphemes to Allophones Transcriber) has been used to translate the transcribed words to phonemic strings by using the SAMPA phonemic notation. The complete lexicon was manually supervised.

 
 

Availability

 

This database is commercially available.

 

Information

 
 

This email address is being protected from spambots. You need JavaScript enabled to view it.

Additional information