We present here a versatile software platform created inside the "Lexica and Corpora for Speech-to-Speech Translation Components" project (LC-STAR, IST-2001-32216) project. The aim of the project is to create language resources (LR) for transferring Speech-to-Speech Translation (SST) components and thus improving human-to-human and man-machine communication in multilingual environments.
There was need of a demonstration platform where the solutions for the different parts of the SST could be integrated: Speech-To-Text (STT), Text-To-Text (TTT) and Text-To-Speech (TTS) engines. Furthermore, the platform was required to allow the recording (and translation) of speech to help in the creation of speech corpora. The solution had to fulfill several requirements, among them:
Possibility to work in a multilingual environment (initially 3 languages: Catalan, Spanish and English).
Facility to incorporate more languages without heavily modifying the core platform.
Allow different combinations/architectures of STT and TTT (for instance, STT and TTT working as separate modules, or TTT integrated into the STT).
Possibility to store both the recognized and translated text, and the original and synthesized speech waveforms.
The platform has been designed to operate in a distributed network environment, with a client/server architecture. Each server performing STT, TTT and TTS runs as a daemon, and the platform establishes connections with each of them as needed using sockets. The hierarchy is centralized, meaning that each of the modules communicates directly with the platform, and has no idea of which other modules are part of the system. This has been done so it results easier to change the functionality of the platform from, e.g. recording a dialogue between two users of the system, to testing the performance of a particular text-translation component by recording the recognized (or original) and translated text.
The user interfaces are also modules separated from the kernel or core platform. Thus, the platform can be accessed either by telephone (the user calls the platform and configures the call using the keypad of the telephone), by microphone/speakers (the machine from which the user wants to start the platform connects through the network) or by text (useful if the user wants, for instance, test a stand-alone text translator.
Each module of the platform (User, STT, TTT and TTS) engines runs as daemons in different machines, and they communicate only with the kernel of the application. The kernel is the part of the platform that handles the communication of the different interfaces and the different events that can occur: end of turn detected, end of call (user hanging-up), connection lost to a certain server, etc.
The communication between the different servers and the platform is done using a specifically designed network protocol. The protocol contains commands to start and stop the servers, resetting them, interrupting the communication, and configuring particular aspects of the task.
Two different versions of the platform are provided to record user-to-user conversations. A first (and simple) version allowed the recording of "natural" conversations, understood as those conversations where users can talk naturally, interrupting each other or talking simultaneously. The complete version of the platform is the one used to both record and translate the conversations. In this case, the users are "forced" to talk by turns, in a half-duplex way. The change of turn can be determined in a simpler way by user interaction (pressing a key in the telephone keypad, pressing a button on the computer, etc.) or, in a more complex (and realistic) way, by the platform (incorporating and end of speech algorithm in the STT engine, setting a maximum time for each user, etc.). This high versatility is needed since the platform may be used to record corpora with different requirements.
When used to translate human-human conversations, the user connects to the platform, sends its language and the phone number or IP address of the user it wants to communicate to. The platform then connects to the second user and gets its configuration (language and phone, microphone or text interface). Then the platform connects each interface to the appropriate server (multiple configurations are allowed here) and signals the user to start talking (or typing). The recorded audio goes through the platform to the STT server, and the recognized text can go directly to the TTS server of be previously translated by the TTT server. Once the turn is over and the end user has listened to all the synthesized audio, the direction of the call is inverted in the platform and the user that was listening before is signaled to start talking.
The platform accepts different configurations and combinations of servers for the different languages. All the configurations can be specified in the command line or in the initialization files, and each of them can be selected at any time (either in a global way, for all the calls, or in a specific way, only for the incoming call). The platform can be monitored from any place with Internet connection, since it has a web interface allowing to select among the pre-configured combinations of servers. The web interface also allows to monitor one of the current conversations the platform is handling by presenting the user with a list of the current conversations. The user can then visualize all the messages the different interfaces are generating (e.g. recognized text, translated text, among of audio synthesized, changes of turn, etc.)
Although the original purpose of the platform was the recording of conversations to obtain speech corpus that could later be used in speech-to-speech translation components, its versatility allows different uses. The text interface permits, for instance, the testing of text translator components without need to use any STT or TTS engines. It has been also used to test integrated speech translator components with different architectures: separated STT and TTT engines, or with the TTT module integrated in the STT. The modular design of the platform and the many possibilities for connecting the different modules, allow many configurations to be used.
One of the objectives of LC-STAR is to develop a demonstration platform to demonstrate the experimental results of the project by translating into three target language pairs: Catalan, Spanish and US-English.
Hotel: asking about hotel services
Hotel: asking about leisure activities
Hotel: Demaing a service
Travel Agency: Booking flight tickets
Travel Agency: Booking hotel rooms
Travel Agency: Booking trip packages
The platform can be configured to be used either for one channel (one person speaks in the source language and the systems provides the translation) or for two channels (two persons speaking through the platform, and the platform performs the translation).
The platform is planned to be distributed, free of charge, for research purposes. It is very easy to plug your own technology: Speech recognition, spoken translation and speech synthesis. You can find more details in this Technical Report or just contact us [javierp (AT) gps.tsc.upc.edu]
Gaia is Copyright (C) 2002-2006 Javier Pérez Mayos <javierp (AT) gps.tsc.upc.edu> and others
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 2 of the License.
You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
This work has been partially supported by the Spanish government, under grant TIC-2002-04447-C02 (Aliado Project), the European Union, under IST-2001-32216 grant (LC-STAR project).
Send your comments and suggestions to Javier Pérez