Problem with LSTM and GRU Layers in Flux

I have built a Time series forecasting model in Flux.jl including an LSTM layer (the problem is the same also when I include a GRU layer). I am able to train the model without errors. However, when I try to run: model(val_samples), or model(test_samples), the model does not return a vector of forecasted targets. Instead, model(train_samples) works just fine.

#My model is the following: 
model = Chain(
    Flux.flatten,
    LSTM(3,32),
    Dense(32,32,relu),
    Dense(32,1)
    )
#The data I have is: 
julia> size(train_samples)
(3, 1, 7642)
julia> size(val_samples)
(3, 1, 955)
julia> size(test_samples)
(3, 1, 955)
#And the labels are: 
julia> size(train_targets)
(1, 7642)
....

#I run the model with: 
ps = Flux.params(model)
opt = Flux.RMSProp()
loss(x,y) = Flux.Losses.mae(model(x),y)

epochs = 300
loss_history = []
for epoch in 1:epochs 
    Flux.train!(loss, ps, [(train_samples, train_targets)], opt)
    train_loss = loss(train_samples, train_targets)
    push!(loss_history, train_loss) 
    println("Epoch = $epoch Training Loss = $train_loss")
end 
#I correctly get results for the training data: 
julia> model(train_samples)
1Ă—7642 Matrix{Float32}:
 0.0961234  0.0967543  0.0972749  0.0951103  0.0937003  0.091914  0.0913435  …  38.6692  43.5427  43.0876  43.2159  43.4824  43.612  43.5726  43.5831

The error I get, after having successfully trained the model, is the following:

julia> model(val_samples)
ERROR: DimensionMismatch: array could not be broadcast to match destination
Stacktrace:
  [1] check_broadcast_shape       
    @ .\broadcast.jl:553 [inlined]
  [2] check_broadcast_shape       
    @ .\broadcast.jl:554 [inlined]
  [3] check_broadcast_axes        
    @ .\broadcast.jl:556 [inlined]
  [4] instantiate
    @ .\broadcast.jl:297 [inlined]
  [5] materialize!
    @ .\broadcast.jl:884 [inlined]
  [6] materialize!
    @ .\broadcast.jl:881 [inlined]
  [7] muladd(A::Matrix{Float32}, B::Matrix{Float32}, z::Matrix{Float32})
    @ LinearAlgebra C:\Users\User\AppData\Local\Programs\Julia-1.9.3\share\julia\stdlib\v1.9\LinearAlgebra\src\matmul.jl:249
  [8] (::Flux.LSTMCell{Matrix{Float32}, Matrix{Float32}, Vector{Float32}, Tuple{Matrix{Float32}, Matrix{Float32}}})(::Tuple{Matrix{Float32}, Matrix{Float32}}, x::Matrix{Float64})
    @ Flux C:\Users\User\.julia\packages\Flux\ljuc2\src\layers\recurrent.jl:314     
  [9] Recur
    @ C:\Users\User\.julia\packages\Flux\ljuc2\src\layers\recurrent.jl:134 [inlined]
 [10] macro expansion
    @ C:\Users\User\.julia\packages\Flux\ljuc2\src\layers\basic.jl:53 [inlined]     
 [11] _applychain(layers::Tuple{typeof(Flux.flatten), Flux.Recur{Flux.LSTMCell{Matrix{Float32}, Matrix{Float32}, Vector{Float32}, Tuple{Matrix{Float32}, Matrix{Float32}}}, Tuple{Matrix{Float32}, Matrix{Float32}}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}}, x::Array{Float64, 3})
    @ Flux C:\Users\User\.julia\packages\Flux\ljuc2\src\layers\basic.jl:53
 [12] (::Chain{Tuple{typeof(Flux.flatten), Flux.Recur{Flux.LSTMCell{Matrix{Float32}, Matrix{Float32}, Vector{Float32}, Tuple{Matrix{Float32}, Matrix{Float32}}}, Tuple{Matrix{Float32}, Matrix{Float32}}}, Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}}})(x::Array{Float64, 3})
    @ Flux C:\Users\User\.julia\packages\Flux\ljuc2\src\layers\basic.jl:51
 [13] top-level scope
    @ REPL[134]:1

Applying the identical data structures to a model where I do not include a RNN cell, but only Dense Layers, does not lead to the problem, as I am able to get forecasts for the training samples, the validation and the testing sample without problems.

model = Chain(
    Flux.flatten,
    Dense(3,32,relu),
    Dense(32,32,relu),
    Dense(32,1)
    )

I don’t understand where the dimension mismatch originates from. I was thinking I might have wrongly defined the RNN chain so that the dimension of the train_sample matter for the future, but I don’t understand how this is possible.

Thank you in advance for any response to this.

Try calling Flux.reset! on the model after training to reset the hidden state.

See the paragraph “Batch size changes” in the documentation.

1 Like

This now enables me to run the model, but am I not dropping all the progress done with Flux.train! ?. How can I estimate how well my trained model is performing on the validation data if I am dropping all the training done previously done on the training set?.

No, you are not discarding the training. The training adjusts the weight matrices, but reset! only affects the hidden state of the LSTM layer. You should do this anyway between inferences, otherwise you carry over hidden state from one inference to the next. Recurrence · Flux explains this.

2 Likes

Thank you very much for the info. I just don’t understand how to fix the problem. I have now tried to simply do:

#Here the forecasts are, correctly, very close to the original labels
y_hat = model(train_samples) 
#But here the forecasts are completely off, even if now I am able to run
#the model on the validation sample. 
Flux.reset!(model)
y_hat = model(val_samples)

If, instead, I artificially set the batch size of all samples (test, train, val) equal to each other, the problem isn’t there, and all forecasts now make sense compared to the original labels.

I don’t get how to exploit Flux.reset! correctly. In the attached documents I didn’t quite find an answer to this.

Actually, I have tried this and it seems to work, though I am not sure it is correct:

epochs = 500
loss_history = []

for epoch in 1:epochs 
    Flux.train!(loss, ps, [(train_samples, train_targets)], opt)
    train_loss = loss(train_samples, train_targets)
    Flux.reset!(model)  #Reset inside the loop after training and the loss
    push!(loss_history, train_loss) 
    println("Epoch = $epoch Training Loss = $train_loss")
end 

y_hat = model(train_samples)
Flux.reset!(model) 
y_hat = model(val_samples)
Flux.reset!(model)
y_hat = model(test_samples)

Yes, this seems sensible. However, your data does not appear to be sequences at all. So I’m not sure what a recurrent network is going to achieve here. The LSTM becomes a dense layer for sequence length one. To be precise, in case you aren’t aware, the last dimension of your data is the batch dimension, because it is flattened into a 2d array in the model.

You can call an Recur unit on 3d-arrays. Then the dimensions will be (features, batch, time).

The data I have is a simple univariate timeseries with 9555 observations.
I have transformed it with a function into the following tensor structure: (3, 1, 9552). Basically I took the last three observations (so three observations at the beginning must be dropped), and then took the fourth observation as the target value, thus I have a target vector with 9552 entries.
The goal is time series forecasting, so basically I want to input the LSTM model a tensor (3,9552) (flattened), and get as output the forecast for each timestep forward (with a window =3) . Of course in the model I also split between training, val, and testing sample.

With your suggestions, the model seems to work (even though it is not accurate at forecasting outside the training sample), though I think this is just a matter of how I sampled the data.

Furthermore, I don’t understand why the model breaks up when I reduce the sample size. I have tried doing the same with Dense Layers and predictions are still quite accurate, whilst the LSTM model seems to work fine with many samples but when I restrict, to, say 500 samples for training, the predictions become constant and do not make sense even on the training dataset. This however should not be the case as in the application I am trying to simulate, reducing the sample does not affect results, so I think there must be some issue in how I feed the data to the model.

Thank you for all the replies.

As @skleinbo mentioned, this is not the correct format for a univariate time series. If you have 3 sequences and 1 variable/feature, the 3D array you pass to a RNN should have shape 1x3x9552, or # of features x number of sequences x sequence length. Usually if you have a large number of sequences, you’ll want to batch them too. In this case it looks like you only have 3, so you can ignore that part and don’t even need to use DataLoader.

Another vital thing which is probably affecting your results is that Flux.reset!(model containing RNN layers) must be called before/after every batch of data is fed to the model. This applies both during training and inference, because otherwise the model is using stale initial state from the previous batch of sequences. For this reason, I generally do not recommend using train! when working with RNNs since it hides the loop over batches and doesn’t give you anywhere to call reset!. Instead, use a custom training loop as outlined in the docs.

2 Likes

Thank you very much.
I was confused since I began working on DL with Keras. It seems to me that Flux allows for much more flexibility, at the cost of having to perfectly understand what goes on “behind the scenes”.