EPPS Transcriptions database was done within the scope of the TC-STAR project (FP6-506738) which was sponsored by European commission. The database comprises recordings of members of the European Parliament speaking in the parliamentary plenary sessions (EPPS) as well as recordings of interpreters. Recordings of Spanish Parliament (PARL) are also included to achive a total of 100 speech hours transcribed. Transcription was performed by Applied technologies on Language and Speech, S.L. (ATLAS), from Spain.The owner of the transcriptions is Universitat Politecnica de Catalunya, from Spain.
EPPS transcriptions consist of 61:53 hours of speech of members of the European Parliament speaking in the parliamentary plenary sessions as well as recordings of interpreters. PARL transcriptions consist of 38:24 hours of speech of members of the Spanish Parliament speaking in the Spanish Parliament and Spanish Congress during plenary sessions and commissions. The total amount of audio recordings including non transcribed sections is 143:10 hours
Each speech file (extension .WAV) has an accompanying file with the transcription in xml format (extension .TRS).
Transcriber 1st pass: Initial markup
- Check audio file is OK
- Creation of skeleton transcription file with date, time and program name
- Initial segmentation (according to detailed rules below)
- Mark speaker changes
- Mark changes in background conditions
- Transcription of all speech segments (if transcription is not available)
- Spell check
Transcriber 2nd pass (recommended resolution: 30 seconds)
- Transcription of all speech segments
- Transcription of (frequent) noises
- Verification of uncertain orthography (specially names)
- Spell check
Transcriber 3rd pass, Validation (recommended resolution: 10 seconds or smaller)
The recordings are made by internet reception and satellite reception from Europe by Satellite. Satellite recordings were decoded and audio streams resampled to WAV files. Internet recordings were provided as RealMedia streams. Recordings were not processed in any way, including Plenary Session pauses and segments of untranslated speech (language different from target language). The “Day's schedule of EbS" to select the raw segments of a Plenary Session. The audio signals transcribed are WAV files with format 16 KHz, PCM, 16 bits, single channel. RealMedia streams were converter to WAV files using WinAmp and RealPlayer software with “Tara Audio Video Plugin for WinAmp”.