Catalan Speech Recognition Resources

SpeechDat Catalan FDB - 1005 speakers

The SpeechDat Catalan Database for Fixed Telephone Network was collected at the TALP Research Center of the Universitat Politècnica de Catalunya (UPC) . The production of this database has been partially funded by the Centre de Referència en Enginyeria Lingüística (CREL). The Laboratori de fonètica de la Universitat Autònoma de Barcelona (UAB) has collaborated in the phonetic work.

This database comprises telephone recordings from 1005 speakers recorded directly over the fixed PSTN using an E-1 interface. 

 

This database was designed following the specifications given in the SpeechDat II project. Equivalent corpora have been collected for other European languages (e.g. Spanish) .

 

The corpora are designed to support the creation of voice driven teleservices in Catalan. The callers spoke 40 items, comprising isolated and connected digits, natural numbers, money amounts, spellings, time and data phrases, confirmation/rejections, forenames and surnames, city names, company names, common applications words, application words in phrases and phonetically rich sentences. Most items are read, some are spontaneously spoken. The recordings come with extensive and standardised documentation. All speech is carefully transcribed on the orthographic level; in addition, a number of clearly audible non – speech events are included in the transcription. Moreover, age and regional background of the speakers are provided. A pronunciation dictionary is added, containing all words that occur in the corpus, with a corresponding SAMPA broad - class phonemic transcription. The data files are formatted according to the ESPRIT Project SAM standards.

 

Database contents definition 

Each call within the database consists of a total of 40 items.

 

Item Corpus contents
1-3 3 application words
1 1 sequence of 10 isolated digits
1 1 sheet number (6 digits)  
2 1 telephone number (9-11 digits) 4 connected digits
3 1 credit card number (14-16 digits)  
4 1 PIN code (6 digits) (set of 150)  
1 1 spontaneous date, e.g. birthday  
2 1 prompted date, word style 3 dates
3 1 relative and general date exp.  
1 1 word spotting phrase using an application word (embedded)
1 1 isolated digit
1 1 spelling of surname (same as O1)  
2 1 spelling of direct. city name (O3) 3 spelled word
3 1 real/artificial for coverage (letter sequences)
1 1 currency money amount
1 1 natural number
1 1 surname (set of 600)  
2 1 city of birth / growing up (spont)  
3 1 most frequent cities (set of 500)  
5 1 most frequent company/agency (set of 1190)  
7 1 forename surname (set of 150 )  
1 1 predominantly yes question 2 questions, including
2 1 predominantly no question fuzzy yes/no
1-9 9 phonetically rich sentences
1 1 time of day (spontaneous) 2 time phrases
2 1 time phrase (word style)  
1-4 4 phonetically rich words


Speakers 

Catalan is mainly spoken in Catalonia, Valencia region and Balear islands. All this Catalan-speaking area is bilingual since Spanish is also spoken in it. There are also a few less populated areas where Catalan is also present: Rosselló and Vallespir (South-east of France), border of Aragon region with Catalonia, and Alguer in Sardinia.  

There is a main dialectal division between East and West that crosses Catalonia and also separates Valencian and Balearic dialects. A secondary division is between North and South, and it separates the Catalonia’s East and West dialects from, respectively, the Balearic and Valencian ones (see Joan Veny, Els parlars catalans, Edit. Moll, Mallorca, 1993) 

Since all the 1005 speakers of this database have been recruited in Catalonia, only two dialects were considered: East and West. The Eastern dialect in Catalonia is actually called Central. Its geographical region includes the demographically and economically dominant Barcelona’s metropolitan area. Those classified as belonging to the West dialect actually belong to the North-Western dialect (the West dialect spoken in Catalonia, in Andorra and in the Aragon´s border), and they were recruited within the Lleida’s vicinity. In this way, the two sets of speakers are representative of the two main dialects spoken in Catalonia. 

The Eastern dialect is spoken in an area that has most of the Catalonia’s population (around 85%). This percentage was imposed when recruiting speakers. 

The distribution of speakers over accent regions is shown in the next table.

 

Number Name of accent/region Number of speakers

Number of speakers (%)

Male

Fem. Total
1 EAST

384

444

828

82.4

2 WEST

90

87

177

17.6


In the next table the distribution of speakers over age groups and sexes is shown. The documentation file SESSION.TBL contains information about the accent region, age and sex of the speakers.

Age groups

Number of speakers

Percentage of total

Male

Fem.

Total

Under 16

7

6

13

1.29

16-30

210

263

473

47.06

31-45

150

136

286

28.46

46-60

83

109

192

19.10

Over 60

24

17

41

4.08

Total

474

531

1005

 


The minimal percentages of speakers that SpeechDat specifies per age group and per sex are verified.

A maximum of one call per speaker was allowed.


Speech file format 

Speech files are stored as sequences of 8-bit 8 kHz A-law uncompressed speech samples (CCITT G.711 recommendation). Each prompted utterance is stored within a separate file. Each speech file has an accompanying ASCII SAM label file


Recording site and platform 

 

Recordings took place at UPC. The main characteristics of the recording platform are:

 

 
Interface: ISDN basic access (BRI)
Board: AVM-ISDN-A1.
Computer: Pentium PC at 120 MHz, 32 MB RAM 4 GBytes SCSI Hard disk. PCI Network card
DOS: Windows NT.
Programming Interface: COMMON-ISDN-API Version 2.0 (CAPI 2.0)
Software: Application Software written in C (UPC ADA program)
Lines:  


Transcription 

The transcription included in this database is an orthographic, lexical transcription with a few details that represent audible acoustic events (speech and non speech) present in the corresponding waveform files. The extra marks contained in the transcription aid in interpreting the text form of the utterance. Transcriptions were made in two passes: one pass in which words are transcribed, and a second pass in which the additional details are added. 

The character set to be used for the transcriptions is ISO-8859.  

Foreign words pronounced in a foreign language are transcribed in the foreign language if they were correctly pronounced. This situation may occur pronouncing foreign companies or agencies. 

Pronunciation variations are not indicated by different spellings in the transcription, but by different phonemic transcriptions in the lexicon. A lexicon LEXICON.TBL includes the Catalan words. Dialectal variations are not included, only the dominant Eastern dialect is considered. 

Number sequences (times, dates, money amounts, etc.) are spelled out to reflect what was said ("seven thirty"; "august twenty first"; "seven forty seven"; "four hundred and ten dollars".) 

Letter sequences occur in spelled words, ZIP-codes, acronyms and abbreviations.  

No punctuation is provided in the transcription other than those symbols used for special transcription purposes. However the label files retain all punctuation provided to the speaker in the prompting text, including mistakes if these occurred. 

Non-Speech Acoustic Events have been arranged into 4 categories and transcribed. Events only are transcribed if they are clearly distinguishable. Very low-level, non-intrusive events are ignored. The event will be transcribed at the place of occurrence, using the defined symbols in square brackets. For noise events that occur over a span of one or more words, the transcription indicate the beginning of the noise, just before the first word it affects. 

The first two categories of acoustic events originate from the speaker, and the other two categories originate from another source. Sounds originating from the speaker usually do not overlap with the target speech, sounds originating from other sources can of course occur simultaneously with the speech. 

The 4 categories are: 

[fil]: Filled pause.
These sounds can well be modelled in a filled pause model in speech recognisers. Examples of filled pauses: uh, um, er, ah, mm.
[spk]: Speaker noise.
All kinds of sounds and noises made by the calling speaker that are not part of the prompted text, e.g. lip smack, cough, grunt, throat clear, tongue click, loud breath, laugh, loud sigh.
[sta]: Stationary noise
This category contains background noise that is not intermittent and has a more or less stable amplitude spectrum. Examples: car noise, road noise, channel noise, GSM noise, voice babble (cocktail-party noise), public place background noise, street noise.
[int]: Intermittent noise.
This category contains noises of an intermittent nature. These noises typically occur only once (like a door slam), or have pauses between them (like phone ringing), or change their colour over time (like music). Examples: music, background speech, baby crying, phone ringing, door slam, doorbell, paper rustle, cross talk.

 The Catalan Database has been transcribed using the software tool UPCRevBD.v2, developed at UPC.


The lexicon 

The lexicon is included in the documentation. 

The lexicon file is an alphabetically ordered list of distinct lexical items (essentially words in our case) which occur in the corpus with the corresponding pronunciation information. Each distinct word has a separate entry. As the lexicon is derived from the corpus it uses the same alphabetic encoding for special and accented characters as used in the transcriptions (ISO-8859).  

The pronunciation lexicon was produced after the transcription phase; it contains, alphabetically sorted, all words found in the "LBO:" transcription (one occurrence for each word), and the list of their phonemic representations. The words appear in the lexicon exactly as they appear in the transcription. The lexicon is case insensitive. 

All the component words have been identified and alphabetically sorted; all fragments, mispronunciations and non speech events have been removed, and only one occurrence of each word have been selected. 

A manual revision was made by the Laboratori de fonètica de la Universitat Autònoma de Barcelona.


Environments 

Calling environment: The set of values is: Home/Office, Mobile. The number of received calls and percentage as a function of the calling environment is shown in the next table.

Environment

Received calls

Percentage

Home/Office

953

94.83

Mobile

52

5.17

 

Speech sample 

You can download this speech file.

Label file sample 

Accompanying ASCII SAM label file to the speech file.

Availability 

 
 

The database is commercially available in 4 ISO 9660 CD-ROM volumes.

 

Information: This email address is being protected from spambots. You need JavaScript enabled to view it.

Additional information