BIOMETRIC - Speaker & Language Characterization


    The development of technologies able to automatically recognize speakers through their voices has been the subject of growing interest over the past few years due to its numerous applications: access control, financial and commercial operations, the audio indexing of meetings and radio and television programs, and police investigations. This field of research involves identifying or checking the identity of speakers on the one hand, and determining the separation boundaries in a signal between various speakers (speaker segmentation) on the other hand.

    Speech signals depend on the physical and emotional state of speakers, such as the size of their vocal cords and tract, their state of health, their mood and their linguistic habits. In addition, the environment in which speech signals are emitted must be taken into account as environmental conditions may distort the signals. The systems that have obtained the best results to date use so-called low level parameters, which are the tone, spectral magnitude and formant frequency. However, it is known that high level features such as dialect, vocabulary, intonation and the duration of utterances can be used to differentiate speakers.

    The TALP Center is basically devoted to the following lines of research

    • Speaker identification and verification using high and low level parameters.
    • Combinations of high and low level parameters.
    • System robustness in environments.
    • Speaker segmentation.
    • Multimodal recognition of speakers.
    • Combination of various features: voices, faces, irises, fingerprints, etc.


    For speaker recognition, some research work was recently initiated using DNN to discriminatively model target speakers using either i-vectors or feature vectors for target and impostor. Two main contributions to make DNNs efficient were proposed, namely impostor selection and network adaptation. The proposed system showed a very good performance on the international NIST SRE i-vector challenge. Based on a recursive DNN (LSTM), a system for speaker segmentation using both acoustic and language modelling was also developed. This technology has been transferred for automatic annotation in a real call-center. In the next three years the main objective will be to enhance the previously developed deep speaker recognition system using more sophisticated deep learning techniques, working with different levels of speech features, or even directly from the raw speech signals, and proposing new impostor selection algorithms according to the structure of the background and training data.

Scroll to Top