Flux seq2seq



I don’t know whether this is the right place to ask but I’m trying to code a seq2seq model in Flux and I’ve got a couple of questions.

The encoder creates the hidden state for the decoder, how can I pass this state to the GRU in the decoder? Is this possible without a custom-built layers?

Secondly, how can I use this model on the gpu? It seems like |> gpu doesn’t work on line 9 because of the indexing?
Also result is of undefined size, so this can’t be ran on my gpu?


model = function(seq, voc_size, max_length)
    seq = onehotbatch(seq, dictionary_fr[:, 1])
    seq = emb_layer_fr*seq
    #split seq in it's columns:
    seq = [seq[:, i] for i in 1:size(seq)[2]]
    x = GRU(300, 256).(seq)[end]
    x = Dense(256, 300)(x)
    result = Vector{Any}(undef, 0)
    input = onehot(1, 1:voc_size) #<BOS>
    for i in 1:max_length
        input = emb_layer_nl*input
        output = Chain(GRU(300, 300), Dense(300, voc_size))(input)
        append!(result, output)
        input = onehot(argmax(output), 1:voc_size)
        input.ix == 3 ? break : continue


I suggest that you take a look at https://github.com/FluxML/model-zoo/blob/master/text/phonemes/1-model.jl for an example of how an encoder/decoder model can be implemented in Flux. With the code you’re showing here, it looks like you’re creating a new copy of every single layer every time you call this function (and even throwing away and recreating the decoder layers each trip through the loop), and I’m guessing that this probably is not what you have in mind for your model’s behavior.

Regarding the GPU question, I haven’t actually used this functionality myself, but the documentation for it is here. My understanding is that you’ll need to make sure to call Flux’s gpu function on both your model’s weights and its inputs to make sure that everything has the appropriate GPU types.


I’ve changed my code quite a bit, but now, my model doesn’t return correct translations.

When I overfit my model with just one sentence pair, it gives the correct output, but when I try to overfit it with 2 (or more) sentence pairs, it returns a mix of the two sentences. It looks like the model just picks the most prevalent word from one of the sentences.

This is the link to a nextjournal notebook with my code and data (I can’t get it running there though).

I would highly appreciate if anyone could take a quick look and tell me what I’m doing wrong or link a julia implementation of machine translation. Chances are I’m making an easy mistake since I’m a complete beginner.



I’m not sure there’s a comprehensive Julia implementation of seq2seq for MT anywhere, although I’ve talked about working on one and I may be able to look at your code soon if I get some time.


I finally found the problem, my input didn’t suit the encoder network. I’m going to upload my code as soon as possible.


@merckxiaan, I spoke to the nextjournal team to figure out why I couldn’t access your notebook (I am interested in seq2seq in Julia). It turns out that you did not publish the journal so as a result no one else can see it. I’d be greatly obliged if you would!


Hello @Nakul_Tiruviluamala,
After I posted my last message, I abandoned this project since I’m a beginner and I was making mistake after mistake.
Since then however, I’m trying to implement Pytorch’s tutorial on seq2seq machine translation, I’m trying to follow the tutorial as close as possible. Now I believe I’m stuck due to a bug in Flux which prohibits me from concatenating a transposed array (https://github.com/FluxML/Flux.jl/issues/378). I probably also made a lot of mistakes here and there. It would be great if you could have a look at the code and let me know your thougts/questions.



You need to click publish, to get a shareable link! That’s just your internal edit link :wink:




Hi @merckxiaan, I’ll definitely be looking at it. I am a beginner as well!


Nice @Nakul_Tiruviluamala ,
I’ve just uploaded my code to a github gist because I’m running in an error with Julia. Issue
Even though my code crashes, the loss does decline for a few steps…
Perhaps you could try to reproduce this error?



Nevermind, I’ve started over once more. Does anyone spot something wrong with my encoder, decoder or attention layer, for some reason, when I train my model the loss get’s stuck and the model predicts some frequent words.

struct Encoder
Encoder(voc_size::Int, h_size::Int) = Encoder(
    param(Flux.glorot_uniform(h_size, voc_size)),
    GRU(h_size, h_size))
function (e::Encoder)(x; dropout=0)
    x = e.embedding*x
    x = Dropout(dropout)(x)
    x = e.rnn(x)
Flux.@treelike Encoder

struct Decoder
Decoder(h_size, voc_size) = Decoder(
    param(Flux.glorot_uniform(h_size, voc_size)),
    GRU(h_size*2, h_size),
    Dense(h_size, voc_size, relu))
function (d::Decoder)(x, encoder_outputs; dropout=0)
    x = d.embedding * x
    x = Dropout(dropout)(x)
    decoder_state = d.rnn.state
    context = d.attention(encoder_outputs, decoder_state)
    x = d.rnn([x; context])
    x = softmax(d.output(x))
Flux.@treelike Decoder

struct Attention
Attention(h_size::Int) = Attention(Dense(2*h_size, 1, tanh))
function (a::Attention)(encoder_outputs, decoder_state)
    weights = []
    results = []
    for word in encoder_outputs
        weight = a.linear([word; decoder_state])
        push!(weights, weight)
    weights = softmax(vcat(weights...))
    return sum([encoder_outputs[i].*weights[i, :]' for i in 1:size(weights, 1)])
Flux.@treelike Attention


I’ve made some progress and put all my code, with some explanations, in a notebook. The model does seem to learn something… more often than not, the subject of the sentence is correct, but the remaining words are gibberish.

Also I notice a big difference in performance with different hyperparameters but I’m not sure how I could choose the optimal ones.

I’d really appreciate someone providing me with some feedback.