Training LSTM in MXNet.jl with differing input and output?



Sorry guys, I am new to this and don’t have good access to any courses or good books on the subject. My question could probably best be illustrated with a translator… the expected output would be quite different from the input( for example input as English and output being Nahuátl ). As far as I can tell that means that the training data should be two separate but correlated data sets. Unfortunately, it is not yet clear to me how I would set that up. Is there anyone that could help me in this respect?

I have run the MXNet.jl example ( and others ) and played around with it a bit but would like to continue advancing.

…By the way, if you have any recommendations of online sources that would help me further educate myself, I would be very appreciative as well.


The LSTM architecture in MXNet works well when you have a one-to-one alignment between input and output. For example, in a language model you’d take a single input token and try to predict the next token in the sequence.

For variable-length sequences (e.g. if you have a 10-token English sentence and a 15-token Nahuátl translation) you need to take a different approach. The standard is to use an encoder-decoder; one RNN (the encoder) reads the entire English sentence and spits out a fixed-length vector (presumably representing the content of the sentence in some way). Then another RNN (the decoder) takes that as input and produces a sequence of output tokens followed by a stop codon.

I think this paper introduced the idea and there’s a TensorFlow example as well. I don’t think we have an example in Julia yet but I hope to put one together soon.


Thank you for your reply MikeInnes. What I understood of that information seems logical. Now I have to be sure I understand the existing LSTM example well enough to modify and implement it.

So, at the risk of flaunting my ignorance…

It looks to me like in the LSTM example (in train.jl the data Line 19) text_tr is fed to the ANN and the expected output is text_val (Line 20), which in the existing example is just the next character in the same text body. So really, by doing that it is training the network to predict the next character.

So what I need to do (as per the English/Nahuátl example) is make text_tr an array of padded Nahuátl and text_val an array of padded English text (each padded section of the same length and being the translation of the corresponding Nahuatl text).

Do I have the right idea here? Or am I way off?

I greatly appreciate your help, thank you!

EDIT: it looks like a new bug has popped up ( [Issue 197] ( ) after updating the LSTM example which prevents me from doing much on this: FullyConnected only accepts SymbolicNode either as positional or keyword arguments, not both.