Research with learning-from-data statistical methods got a strong increase in the first decade of the new century when a research line on speech synthesis work shifted to statistical machine learning techniques.
After decades of small year-by-year increase in performance, the introduction of deep learning techniques is producing big steps towards human or even super-human performance in tasks as text-to-speech.
Our recent work on using DNNs has focused on carrying out multiple speaker speech synthesis, speaker adaptation and speaker interpolation with multi-output recursive (RNN-LSTM) networks. Recently, we have produced more expressive speech including at the input semantic features derived automatically from raw text or applying transfer learning. Our goal for the forthcoming years is to explore end-to-end architectures which can, in one side derive automatically the linguistic features from raw text, and on the other side, generate directly the speech waveform, without the quality loss that the parametric representation imposes.
A seminal work in speech enhancement uses generative adversarial DNNs. The work will be continued and extended aiming not only at reducing the noise but also to other distorsions of the signal as non-linear distortion, chopped speech, etc.