Spanish Speech Recognition Resources

The EUROM 1 Spanish Database was recorded in the framework of the European ESPRIT Project 6819 SAM-A.The Database was designed for Automatic Speech Recognition assessment purposes. Contains recordings from sixty speakers in a true anecoic room. Speakers were selected to obtain a good dialectal coverage. The database contains numbers, passages, sentences, CVCV and CVCV in carrier phrases. The sampling frequency is 20 KHz. 


Database contents

The Spanish EUROM.1 database consist of three corpora.1- Many Talker Corpus (60 speakers):
  • 100 numbers, each spoken once
  • 5 sentences, each spoken once
  • 3 passages.
2- Few Talker Corpus (10 speakers):
  • 100 Numbers, each spoken 5 times - 25 sentences
  • 15 passages
  • CVCV words, each spoken 5 times.
3- Very Few Talker Corpus (2 speakers):
  • 82 CVCV words embedded in 5 different carrier phrases, spoken once
  • 10 carrier phrase words, each spoken five times

Prompting test 

The EUROM.1 Prompting texts consist of three parts, Passages and Sentences, Numbers and CVCV words.
  • Passages and Sentences: The passages consist of 40 passages and the Sentences of 10 blocks of 5 sentences. A passage consist of 5 tasks related sentences whereas the sentences are designed to compensate the uneven diphone distribution of phonotypical transcriptions of the passages.The database includes the list of passages and sentences and their phonotypical transcription.One recording take consist of either a passage or a block of five sentences. Each passage was displayed as a single block and speakers tried to produce the block with natural intonation. The five sentences in each block were displayed individually.
  • Numbers: A list of 100 numbers were divided in five blocks of 20 numbers. Each block was recorded as a single take.
  • CVCV: A total of 82 CVCV words were recorded. The CVCV words were divided in 6 blocks. One take consist of one block with the CVCV words displayed one at a time. The CVCV were spoken by all the talkers in the Few talker group. Furthermore, the CVCV words were recorded embedded in carrier phrases. Each CVCV word was recorded in context in five carrier phases. The carrier words were 10 and were recorded as isolated words. CVCV words embedded in carrier phrases and context words were spoken by the Very Few talker group. A list of CVCV words, CVCV words in context, context words and their phonotypical transcription can be found in the database.

Speaker selection 

The database contains sixty subjects, males and females, selected from more than 100 subjects to ensure a wide dialectal variation. Included in th database a talker description file SPEAKERS.DBF. The following age versus sex distribution appeared.
20-30 11 11
30-40 11 11
40-50 7 4
50+ 3 2

Recording envirment 

The recordings were carried out in a true anecoic chamber of the Escuela Universitaria de Ingeniería técnica de Telecomunicación "La Salle" of the Ramon Llull University (RLU). Colleagues from RLU provided technical support in calibration test of the recording room.Prompting texts were displayed to talkers using a monitor sited in the anecoic room. This was the only noise source in the room.

Recording control 

Two operators, an engineer and a phonetician supervised the recordings in an annex room. At the beginning of the session, speakers were asked to read as many passages as necessary to calibrate the system, because only one passage was proven to be insufficient. Gain was adjusted so that the normal speak level of the speech reached a reference point, 6-12dB below peak.During recordings, a phonetician listened for any reading mistakes asking for re-takes. The talkers were urged to ask at any time for a break when needed.Microphone distance: 30cm from the lips, 15 degrees offaxis 90 degrees incident. No head positioner was used.5% of the material was checked immediately after the session.1KHz Calibration recording was made at the start of each session. Square wave and silence calibration was made every week.

Recording Mode and Prompting Style 

The recordings contain no speaking errors.
  • Recording Mode 1: A take was recorded in one complete segment, the sampling and transfer process was started at the beginning of the take, all the acoustic signal was recorded and the sampling and transfer process was only stopped at the end of the take.
  • Prompting Style 1: ABORT TAKE and re-record. The subject was instructed that if he/she made a speaking error the prompting and recording systems were stopped by an escape mechanism. This situation was indicated to the subject and the prompting system was started to re-record that take.
  • Mixed Timing Strategy: The timing of the prompt was controlled by a logical combination of a predetermined interval and the endpoint of an utterance. The display of each new prompt was triggered by whichever was later of the predetermined interval or the endpoint.
  • Manual Timing: Used only for passages.
The relevant parameters selected for Mixed Timing were as follows:
Extinction level: -40 dB (level the signal must cross down to be considered as silence)
End Signal Silence: 500 ms (duration of silence that determine the end of recordings)


This database is commercially available.


