Neural Machine Translation with Minimal Parallel Resources

Motivation and Approach

Machine translation has been revolutionized by the introduction of the neural approach. However, this approach requires large amounts of parallel data, and its advantage over traditional statistical methods diminishes when such data is not available, as is the case for many low-resource language pairs. The default strategy for translation between pairs of languages that lack significant mutual parallel corpora is to pivot through a third language (usually English) but sizeable third-language parallel corpora do not exist for all language pairs - and when they do exist, they are often out of domain.

Our aim in this workshop is to improve the performance of neural machine translation (NMT) in settings where parallel resources are limited, for instance, to a small parallel corpus and/or a bilingual lexicon. To do so, we will pursue two strategies for exploiting monolingual resources in source and target languages: developing techniques to induce translation relations from information latent in monolingual corpora; and enhancing NMT to exploit rich explicit linguistic annotations. These strategies are complementary, and have the practical advantage that they can be investigated in parallel, with few mutual dependencies. Both have the potential to improve the performance of NMT even when large parallel corpora are available, a setting that we will also evaluate. The remainder of our proposal describes each of these approaches in more detail.

Exploiting Unannotated Monolingual Corpora

Humans who wish to learn a foreign language do not do so by studying large parallel corpora. The best way for people to learn a language is by using it in the world, but it is also possible for us to learn from purely textual cues, beginning with some seed equivalences and reading in the foreign language to expand our knowledge.

In this part of the project, we aim to determine what minimal set of parallel resources is required to enable reasonable-quality neural MT, given the availability of monolingual corpora in two languages. Inspired by the human example, we propose to begin by building a strong model of the target language, then adding source-language information to the model, informed by bilingual constraints. We will investigate the most economical constraints - for instance the number and nature of lexical or sentential translation pairs - and also look at what other factors strongly affect outcome, such as domain match, language similarity, and scale and quality of monolingual resources.

Previous work that is very relevant to our proposal is dual learning for machine translation [1]. Given monolingual corpora in languages A and B, this approach trains NMT models A→B and B→A by round-trip translation from a monolingual A corpus, using a B language model score as the objective for A→B, and reconstruction error as the objective for B→A (and symmetrically for the monolingual B corpus). Although in principle no parallel data is required for this method, the authors only experiment with settings in which a large amount is available, and use this data to initialize the NMT model. Doing so with 1.2M sentence pairs matches the performance of a pure NMT system trained on 10x more parallel data.

A possible starting point to improve round-trip strategies for minimal-parallel-resource translation is to reinterpret them through the lense of decipherment [2]. Knowing language E, we wish to maximize our knowledge of language F, represented by the probability assigned to a corpus F. In decipherment, one typically models each sentence f in F using p(f) = sum_e p(f|e) p(e); in our case p(f|e) will be the NMT model we wish to train. To make this sum over e tractable, we can borrow from round-trip approaches, and sample e from another NMT model p(e|f). For each sampled sentence e’, we update the sampling model so as to minimize the distance between model probability p(e’|f) and p(f|e’) p(e’) / p(f). By starting from decipherment, we have gained access to a number of new terms that can provide additional learning signal - if we remove the p(e) factor in the first objective, and p(f|e’) / p(f) in the second, this reduces to the approach in [1]. A symmetrical step can also be carried out for each sentence in E.

To incorporate parallel resources into this procedure, we can use them to initialize the coupled models p(e|f) and p(f|e). For small parallel corpora, this step is straightforward. For lexical resources like bilingual dictionaries, we could begin with autoencoder models p(f|f) and p(e|e), then use lexical equivalents to establish the mapping for source-side embeddings, possibly augmented with frequency or morphological cues for out-of-dictionary words.

Exploiting Rich Syntactic/Semantic Annotations

The basic neural machine translation (NMT) approach is sequence-to-sequence [3], where a recurrent neural network encodes the source sentence, and a decoder generates the target sentence word-by-word. One of the key components of NMT is the attention mechanism [4], whereby the decoder focuses its attention on those parts of the encoded input which are most relevant to the generation of the next word, almost playing the role of the alignment in the traditional statistical MT approach.

Although the typical NMT approach uses sequence-to-sequence transduction, there are many problems requiring other forms of transduction. The core of this part of the project is to investigate various methods to formalize attention and decoding over structures such as trees, graphs, lattices, and forests. Once developed, they can be then used in the encoder-decoder architecture to generate the output while attending to various parts of the input. Potential applications are tree-to-string translation to exploit the syntactic structure of the input sentence, forest-to-string translation to capture the ambiguity in the syntactic structure of the input, lattice-to-string translation to capture the ambiguity in the input when translating from speech transcripts, and graph-to-string translation to capture the semantic structure of the input [5].

We specifically investigate the use of deep semantic representations of the source text in machine translation. The meaning of a source sentence is represented with a semantic graph, which encodes relations occurring between the words (roughly, who is doing what to whom). This semantic graph is then used to generate a sentence in the target language, giving rise to more appropriate word choices and word ordering, and, critically, preserving the important semantic relations in the original source sentence. This project aims to combine the discrete semantic graph formalism with neural translation models. It has been shown that continuous space representations capture semantic/syntactic relations to some extent, and our goal is to enrich the continuous representation with explicit semantic information.

The 2017 Jelinek Workshop will be a perfect opportunity to explore attention mechanisms for the aforementioned combinatorial structures and their usage for leveraging deep semantics in machine translation. Our proposed research consists of the following directions:

Encoding trees, forests, and graphs. A first attempt to encode trees is based on a bottom-up approach to compute the representation of phrase structures using the representations of their children. However, encoding forests is more challenging due to potentially multiple derivations for a node of the forest; similarly the encoding of loopy-graphs. An initial attempt will be based on converting the loopy-graphs to directed acyclic graphs (DAGs).
Structured attention over trees and forests. Once an encoding of the input is constructed, we need to define a suitable attention mechanism. Intuitively, the attention should respect the combinatorial structure of the input.
Investigating the use of synchronous and quasi-synchronous grammars for input encoding and decoder attention.
Transduce the source sentence to its semantic graph representation. Existing semantic parsers can be used, or new ones can be developed. An initial attempt will be to develop a neural transducer to map sentences to their semantic graph representation based on [6]. The semantic graph is further encoded to create distributed representation of its nodes.
Transduce the semantic graph to the target sentence. This involves developing a neural decoder to generate the target translation by attending to a semantically augmented source sentence.

In summary, we propose two thrusts to investigate the use of monolingual resources to improve neural machine translation, one leveraging large-scale monolingual corpora, and the other focused on linguistic annotations. Though these approaches are motivated by scenarios where parallel resources are scarce, they also have the potential to improve any NMT system; therefore, we will evaluate in both low- and high-resource settings.

References

[1] Yingce Xia, Di He, Tao Qin, Liwei Wang, Nenghai Yu, Tie-Yan Liu, Wei-Ying Ma, “Dual Learning for Machine Translation”, NIPS, 2016.

[2] S. Ravi and K. Knight, “Deciphering Foreign Language”, ACL, 2011.

[3] I. Sutskever, O. Vinyals, Q. V. Le, “Sequence to Sequence Learning with Neural Networks”, NIPS 2014.

[4] D. Bahdanau, K. Cho, Y. Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR, 2015.

[5] B. Jones, J. Andreas, D. Bauer, K-M. Hermann, and K. Knight, “Semantics-Based Machine Translation with Hyperedge Replacement Grammars”, COLING, 2012.

[6] J. Flanigan, S. Thomson, J. Carbonell, C. Dyer, N. Smith, “A Discriminative Graph-Based Parser for the Abstract Meaning Representation”, ACL, 2014.