Author: Dr. Marta R. Costa-jussà, a Ramón y Cajal Researcher, TALP Research Center, Universitat Politècnica de Catalunya, Barcelona
This week, we have a guest post from Marta R. Costa-jussà, a Ramón & Cajal Researcher from the TALP Research Center at the Universitat Politècnica de Catalunya, in Barcelona. In Issue #37 we saw that, in order for zero-shot translation to work well, we must be able to encode the source text into a language-independent representation, and to decode from this common representation to the target language. In this week’s issue, Marta gives us more insight on this topic, and explains how to build such a system incrementally.
Introduction
Multilingual Neural Machine Translation is a standard practice nowadays. A typical architecture includes one universal encoder and decoder that are fed with multiple languages in training which allows for zero-shot translation in inference. The decoder is told which language to translate by simply recognising a tag in the source sentence that has this information. An alternative to this architecture is the use of multiple encoders and decoders for each language and sharing an attention layer which becomes the interlingua component. In both cases, components are trained at the same time and adding a new language implies to retrain the entire system.
Joint training and Incremental Language Addition
We propose an architecture that allows to incrementally add new languages, refraining from training languages already in the system. For this, we propose an architecture of independent encoders and decoders and having one encoder and one decoder for each language. These encoders and decoders share the same intermediate representation.
Let’s assume we initially train our system with two languages (X and Y). To train a multilingual system with these two languages, we combine the tasks of auto-encoding in both languages (XX and YY) and translation from X to Y (XY) and from Y to X (YX). This is performed by optimising the auto-encoder losses from both languages and the two translation losses . In addition, we compute yet another loss which minimises the distance between the intermediate representation of encoder X and encoder Y. We refer to this loss term as the interlingua loss .
Given the jointly trained model, the next step is to train a language Z without retraining any of the languages in the system. Having parallel data from language Z to either X or Y (let’s assume having parallel data Z-X, for illustration), we train a new bilingual system. We use the previously trained decoder X and train our encoder Z. Note that our decoder X is frozen and we only train the new module which is encoder Z. By doing this we are forcing encoder Z to produce similar representations to the already trained languages. As a consequence, our system is now able to translate from language Z to X and, in addition, we allow zero-shot translation between Z and Y because our architecture builds on compatible encoders and decoders.
Preliminary results
We have analysed how the model behaves for different low resourced languages (English, Turkish and Kazakh). Our model outperforms current bilingual systems by 5% and supersedes pivoting approaches by 14%. These results confirm that we are able to train independent encoders and decoders which are able to share intermediate representations.
However, the visualisation of the intermediate representations for different languages shows that similar sentences still tend to be placed in different points in the intermediate representation. This contrasts with previous good translation results that showed compatible encoders and decoders. Contrary to our expectations, this suggests that the system may not require common representations to learn compatible modules.
In summary
We show first steps towards achieving competitive translations with a flexible architecture for multilingual and zero-shot translation that enables scaling to new languages without retraining languages in the system.
One of the next steps will be to further investigate the learning compatible representations versus forcing the exact same representation since our focus is to benefit from the advantages of the interlingua translation approach (i.e. reduction of quadratic dependency on languages to linear and incremental training) which may not imply creating such universal representation.
Acknowledgements
This work is in cooperation with Carlos Escolano and José A. R. Fonollosa and the technical paper can be found here. This work is supported in part by a Google Faculty Research Award and the Spanish Ministerio de Economía y Competitividad, the European Regional Development Fund and the Agencia Estatal de Investigación, through the post-doctoral senior grant Ramón y Cajal.