Building simple sequence-to-one RNN with Flux

Can anyone please help with my sequence-to-one RNN. Here is where I am:

Each of my sequences consists of seq_length vectors (with zero padding if necessary for shorter sequences), each of size n_features, and if it makes sense for better performance, I want to pack several sequence into single tensor (to transfer to GPU with CUDA.jl as described in Flux doc) . So I’m having 3 dimensions (n_feature, seq_length, n_samples).

Having this data setup I want to design Layer (which is very similar to Flux.Recur), which also iterates though all time steps and outputs n_samples results. It all comes down to the following code:

mutable struct StepRNN{C, T}
    cell::C     # RNN cell
    state::T   # tensor of states for each of n_samples in batch
end

function (r::StepRNN)(x) 
   n_feature, seq_length, n_samples = size(x) # extract tensor size
   reset!(r)                # reset state of RNNCell to zeros, since we start new sequence processing
   r.state, y = r.cell(r.state, x[:,1,:])     # step through time steps for all n_samples simultaneously 
   for i in 2:seq_length
       r.state, y = r.cell(r.state, x[:,i,:])
   end
   return y                                         # return results for each of n_samples 
end

Does this design/code makes sense or can anyone please point me to better implementation of sequence-to-one with Flux? Or how can I fix/improve this code?

Thank you very much in advance,

Why not use map or dot broadcasting with Flux’s built-in RNNs as per Recurrence · Flux? This use-case should already be handled by Recur.

I’m looking at this (correct me if this is now what you mean):

m = Chain(LSTM(10, 15), Dense(15, 5))
m.(seq)

But this is not what I want to achieve. What it does to me, it pushes to Dense layer output of each time-step result. And I want to push to Dense layer only output of last step. and Ideally do this for several samples simultaneously (do mini-batching), so that Dense layer processes several samples.

Or can you clarify more how dot can help with it?

rnn = LSTM(10, 15)
fc = Dense(15, 5)

outputs = rnn.(seq) # or map(rnn, seq)
fc(outputs[end])

Flux’s RNNs already support minibatching, all you need to do is pass a matrix of size batch x features at each timestep rather than a feature vector. The devdocs (hopefully soon to be stable) explain this much better than me.

1 Like

Are you saying that we can implement it like:

model = Chain(
    x -> LSTM(a, b, ?)(x)[end], 
    Dense....)

Flux.train!(...(loss of model)...)

Where do we do reset! of internal state of LSTMCell, is it in the (loss of model)?

Can you help more here?
(I really would avoid waiting for doc to address this question)

I think my point might’ve been unclear. You can go to that link right now and read a much better RNN tutorial than what is currently linked from fluxml.ai. The only reason it’s not live on the stable docs yet is because we need to cut a release :slight_smile:

Anyhow, my question is why you need Chain at all. For example:

rnn = LSTM(10, 15)
fc = Dense(15, 5)

function loss(seq, ...)
  reset!(rnn)
  x = rnn.(seq)[end] # or use map, or just a loop
  return fc(x)
end

When using RNNs, Chain only makes sense if you want to pass the RNN output to the rest of the network at every timestep (e.g. sequence to sequence).

Thank you very much for helping!

I would prefer to have reset in one place, it should belong to model to me, not the loss, because model can be called separately, and it would be strange to ask for additional requirement to call reset! before (or after) calling for model(new data).

But there is a bigger problem that I got. With similar to your above suggestion, I created the code where each of each x is array of data for time step i. This is pretty much what you have above, but is not working. And here is why:

If you look at the code of RNNCell, you will notice that h field is a vector, but it should be matrix in my case, as there must be separate hidden state for each input in minibatch.

Specifically h::V is initialized as zeros(out). But think what happens if I pass x which is i-th time step of all samples in minibatch, single hidden state h will be broadcasted to all samples, which is not what I want.

Does this make sense? Or how would you do minibatching then?

I mean, this is already the case if you use model = Chain(...). There is no functionality in Flux for auto-calling reset!, so you will have to do it yourself at some point. That said, doing so is pretty straightforward:

rnn = LSTM(10, 15)
fc = Dense(15, 5)

function model(seq)
  reset!(rnn)
  x = rnn.(seq)[end] # or use map, or just a loop
  return fc(x)
end

function loss(seq, y, ...)
  y_hat = model(seq)
  return loss_func(y, y_hat)
end

Now you can use model without needing to reset manually or putting reset! into the loss function.

Have you actually tried calling the RNN with a minibatched input like you describe? It’s a little confusing and I think we could make it less so, but everything works as you’d expect:

julia> rnn = RNN(10, 3)
Recur(RNNCell(10, 3, tanh))

julia> rnn.state
3-element Vector{Float32}:
 0.0
 0.0
 0.0

julia> rnn.cell.h
3-element Vector{Float32}:
 0.0
 0.0
 0.0

julia> x = rand(Float32, 10, 8);

julia> rnn(x)
3×8 Matrix{Float32}:
 0.121863  0.0712726  0.468342   0.0159795  -0.50595    0.217166  0.321759  0.0969098
 0.78138   0.0184485  0.309471  -0.131435   -0.0146722  0.552875  0.227291  0.191328
 0.938252  0.981406   0.826487   0.98748     0.974808   0.960942  0.963614  0.964724

julia> rnn.state
3×8 Matrix{Float32}:
 0.121863  0.0712726  0.468342   0.0159795  -0.50595    0.217166  0.321759  0.0969098
 0.78138   0.0184485  0.309471  -0.131435   -0.0146722  0.552875  0.227291  0.191328
 0.938252  0.981406   0.826487   0.98748     0.974808   0.960942  0.963614  0.964724

julia> rnn.cell.h
3-element Vector{Float32}:
 0.0
 0.0
 0.0

julia> Flux.reset!(rnn)
3-element Vector{Float32}:
 0.0
 0.0
 0.0

julia> rnn.state
3-element Vector{Float32}:
 0.0
 0.0
 0.0

As you can see, the hidden state is actually stored in Recur and not the RNN cell. That hidden state does start off as a vector, but will be overwritten as a matrix with the right number of samples if you pass it a minibatched input.

1 Like

Thank you very much, the solution you provided worked for RNN (both for CPU and GPU).

There are still issues related to conversion of the code to use CUDA.jl and other types of cells (GRU and LSTM): when I convert go " |> gpu " I’m getting runtime error (in Flux.train!()) related to rnn.(seq)[end], but I will need to investigate it more, if this is not the issue in my code with preparing data or something.

What I want to achieve is to have sentiment classifier similar (or the same) as from keras example, which would work both for CPU and GPU and has close performance. It looks very essential, but I could not locate anything close on internet with Flux yet.