Sequence to sequence learning with neural networks

In the NIPS 2015 paper "Sequence to sequence learning with neural networks" by Ilya Sutskever et al., a machine translation approach is outlined that uses a form of recurrent neural networks called long short-term memory (LSTM). The network is trained and evaluated on the parallel corpus WMT 14, to translate between English and french.

The model consists of two components, both implemented as LSTMs, (i) that encodes a source sentence into a vector representation, and (ii) that decodes the same representation into a sentence in the target language. The decoding is done by optimising over possible target sentences, whose probability is conditioned on the source sentence representation computed using the encoder LSTM. The optimisation over sentences is performed using a left-to-right beam-search.

A curious insight that the authors came across is that reversing the direction of the source sentence dramatically improves the result. Though they were open with the fact that they don't fully understand why this helps, they theorise that it may be because this trick decreases the distance between words in the beginning of the sentences. However, it also increases the distance between the words at the end of the sentence, but this seamed to have a lesser effect. The latter of the two curiosities the authors had no explanation for. Further the authors speculate that this trick might improve the result for regular recurrent neural networks as well, though the admit they have no data to suport this claim it still might be worth trying out.

To evaluate the model two experiments was conducted:
In the first experiment a translation was directly generated using the model and scored using BLEU. In this setting the model performed well but did not reach state-of-the-art performance.
The second experiment the model was used as a reranker. Scoring translations generated using a traditional SMT system. In this setting the model did better, reaching a BLEU score of 35.6 which is comparable to the state-of-the-art of 37.

The full paper can be found here.
However, the details regarding LSTM is left out. Referring instead to the Alex Graves paper Generating Sequences With Recurrent Neural Networks.