Simple Flux LSTM for Time Series

Hi, I have been trying for a long time to get Flux working for a basic time series.

I think if I can get a really simple example working, I would be able to go from there, but when I try to do the ‘obvious’ thing, it falls over.

In the example below x is a matrix of (34) lagged returns by 7300 rows and y is a vector of 7300 to be forecast via MSE.

Can anyone please help or direct me to any useful relevant documentation?

After loading Flux and the data, here is what I tried:

julia> N=length(x[1,:]) # 34

julia> m = Chain(
LSTM(N, 10),
Dense(10, 1))
Chain(Recur(LSTMCell(34, 10)), Dense(10, 1))

julia> function loss(xs, ys)
l = sum((m(xs)-ys).^2)
return l
loss (generic function with 1 method)

julia> opt = ADAM(0.01)
ADAM(0.01, (0.9, 0.999), IdDict{Any,Any}())

julia> evalcb = () -> @show loss(x, y)
#18 (generic function with 1 method)

julia> loss(x’,y’)
(34, 7354)
(1, 7354)

julia> Flux.train!(loss, params(m), (x’, y’), opt)
ERROR: MethodError: no method matching loss(::Adjoint{Float64,Array{Float64,2}})
Closest candidates are:
loss(::Any, ::Any) at none:2
[1] macro expansion at /Users/davide/.julia/packages/Zygote/ApBXe/src/compiler/interface2.jl:0 [inlined]
[2] _pullback(::Zygote.Context, ::typeof(loss), ::Adjoint{Float64,Array{Float64,2}}) at /Users/davide/.julia/packages/Zygote/ApBXe/src/compiler/interface2.jl:7
[3] #14 at /Users/davide/.julia/packages/Flux/CjjeX/src/optimise/train.jl:84 [inlined]
[4] _pullback(::Zygote.Context, ::getfield(Flux.Optimise, Symbol("##14#22")){typeof(loss),Adjoint{Float64,Array{Float64,2}}}) at /Users/davide/.julia/packages/Zygote/ApBXe/src/compiler/interface2.jl:0
[5] pullback(::Function, ::Zygote.Params) at /Users/davide/.julia/packages/Zygote/ApBXe/src/compiler/interface.jl:103
[6] gradient(::Function, ::Zygote.Params) at /Users/davide/.julia/packages/Zygote/ApBXe/src/compiler/interface.jl:44
[7] macro expansion at /Users/davide/.julia/packages/Flux/CjjeX/src/optimise/train.jl:83 [inlined]
[8] #train!#12(::getfield(Flux.Optimise, Symbol("##18#26")), ::typeof(Flux.Optimise.train!), ::typeof(loss), ::Zygote.Params, ::Tuple{Adjoint{Float64,Array{Float64,2}},Adjoint{Float64,Array{Float64,2}}}, ::ADAM) at /Users/davide/.julia/packages/Flux/CjjeX/src/optimise/train.jl:80
[9] train!(::Function, ::Zygote.Params, ::Tuple{Adjoint{Float64,Array{Float64,2}},Adjoint{Float64,Array{Float64,2}}}, ::ADAM) at /Users/davide/.julia/packages/Flux/CjjeX/src/optimise/train.jl:78
[10] top-level scope at none:0

It doesn’t seem to like evaluating ‘loss’ during training, even though ‘loss’ worked when I just tried evaluating it on the same data outside training. ???
Can anyone help please [apologies for my ignorance… I can’t seem to find any proper documentation (other than a few terse examples) anywhere!]



Looking at the first line of the error trace it seems like train is trying to call your loss function with only x'. I think the reason is that train will iterate over the input data. Try putting that tuple in an array, like this : [(x', y')]. This should make the iteration first pick out the tuple and call your loss function the splatted tuple.

1 Like

Btw, flux expects time series to be a sequence of ninputs x batchsize. If your x is a matrix where the columns are the time steps I think you need to call the loss function ‘column by colum’ and compute the loss using the last output (assuming you’re after predicting the next step). Sorry for lack of examples, writing this on the phone.

Relevant doc:
Pay attention to the dots :slight_smile:


Hi DrC, Thanks so much for your help. In my case, each row represents a day (y[i]) with its previous history x[i,:] ) so if I were to batch these, they would be in the form you say. Do the batches need to be in sequence within? [I had thought not. Or have I got it completely wrong?]
[PS, it seems to be working]

I guess I don’t understand (see below). It seems that evaluating the model
from example from inputs 1: end works but evaluating from 2:end throws an error:

julia> size(m(x[1:end,:]’))
(1, 7354)

julia> size(m(x[2:end,:]’))
ERROR: DimensionMismatch(“arrays could not be broadcast to a common size”)

Thanks in advance for any insights

I haven’t run your example, but from the looks of it, Flux will assume that in the first call you are feeding it the first timesteps of a batch of 7354 (different) examples of one feature.

This will cause the recurrent layer to store a hidden state of dimension Nx7354 (where I assume N is 34 from above).

In the second call, flux thinks you are feeding it the next timesteps for all those examples. However, now one example seems to be missing and the consequence is a dimension mismatch error as a Nx7354 array (the hidden state) is added to a Nx7353 array (the new timestep).

Before I attempt to explain how to fix this, just a control question:

From your explanation I get the impression that you have 34 values for each day and you want to learn the mapping from those 34 values to a single value (your ys) for each day.

The question assuming this is correct: Do you also belive that the data (x[i < ic,:]) from past days can be used to infer the value y[ic]? If not, you are probably better off just viewing each day as an independent observation and not use recurrent layers.

Ok, now assuming that what you have is one example of a 7354 long sequence of days where information from the past days matter, what you shall feed into your model is 7354 Nx1 timesteps in some kind of loop.

Here is one (untested) example:

xseq = Flux.unstack(x, 2) # Should translate x to an 7354 long array of 34x1 arrays
yseq = Flux.unstack(y,1)
for (xt, yt) in zip(xseq, yseq) # Loop over each time step
    Flux.train(loss, params(m), [(xt,yt)], opt)

There are obviously more concise ways to write the above which don’t create as many temporary variables, but this is the gist of it. For instance I think you can pass the zipped iterator directly to the train method without the loop outside.

Now, you might have a difficulty to learn your sought function if you only have a single example and I also think that 7000 timesteps is pushing it when it comes to what kind of time dependencies an LSTM can learn, but I guess this is another topic and there should be plenty of material on the web on stuff like this.


Hi DrC, Thank you so much for taking the time to help me with this. BTW, I have a long experience with state-space models including Kalman Filters. I think my main problem is that I have been confused by people who have ‘explained’ LSTM implementations to me (including online) who I can see now do not understand what is really going on themselves. [The algorithm you describe corresponds more closely to my impression prior to listening to them].

Cheers, glad I could help (assuming what I wrote actually worked out for you).

Anyways, most DL frameworks use 3D arrays as input to recurrent layers (and for example do the looping over the time dimension “inside” the layer). Maybe this is part of the source of your confusion? While this to some extent obscures what is going on I guess it has the advantage that the user does not need to run their model in a significantly different way when there are recurrent layers in it; one batch of examples is always one array.

Hi DrC, Thanks again. My exchange here with you has generated a (slightly heated?) debate between myself and a colleague from computer science, which I think boils down to this: Along one dimension of the input, are successive values just shifted versions, each from the previous?

Thanks again,



Your colleague is right here. There does not need to be a strict 1-1 relationship between input and output. It all depends on how you set up your model and training loop.

For example, I think it is pretty common to just use the last time step as output. Consider for example the toy use case of regressing a few sample moments (mean, var, skew etc) from a sequence (which is a pretty good “getting started” example btw).

I think this should just work out of the box in flux. You’d obviously would need to alter the training loop in the example I gave (first give the whole sequence of inputs and then compute the loss), but other than that it should work.

Thanks again for that. I think it is getting clearer.

I had just figured out my error, but my colleague was wrong, too, as he did not have a dimension of the input clearly associated with a timestep, which confused me.
I think I see the need for 3D arrays now, with potentially only one output.



Hi DrC Me again (sorry!)

I studied your suggestions above and ran something which seems to have worked(?) but I don’t understand the results.

First I made 7354 cases of sequences of 1-dimensional input sequences of length 30. I made the y’s into 7354 1-D arrays like your example, then put them as tuples into an array called “data_array”


for i=1:7354 # Loop over each time step

m = Chain(
LSTM(30, 10),
Dense(10, 1))

function loss(xs, ys)

l = sum((m(xs).-ys).^2)
return l


opt = ADAM(0.01)

Flux.train!(loss, params(m), data_array[1:6000], opt)

[Seems to have run OK, but…]

I would have expected that the forecast for the first input sequence would be a single number, but it is dimension 30 - the same as the length of the sequence.
Furthermore, it changes when I try to predict it again(!?).

julia> m(data_array[1][1])
1×7354 Array{Float32,2}:
0.0432908 0.0404577 0.0374243 0.0417421 0.0342939 0.0330891 … 0.0232059 0.0215446 0.0155102 0.0156021 0.0189834

julia> m(data_array[1][1])
1×7354 Array{Float32,2}:
0.0483132 0.0460149 0.0430189 0.047561 0.0400533 0.0391228 … 0.0303929 0.0281151 0.0235584 0.022974 0.0262266

Now I am really confused.

BTW, I thank you for your help so far. Maybe if I get this figured out,
I might post the solution to possible help others.

Thanks for any more help.



The way you run the model, it will use the features (x) from day 1:i to predict the label (y) of day i. As such, the model will output one predicted y for each input x, it is just that for the first is it has less history to use (i.e you can expect estimates to be worse).

If you are interested in predicting y[t+1] given x[1:t,:] for a specific value of t (this is essentially the “last time step” use case I mentioned above) I think you could reformulate the training so that you first feed the model all x[1:t-1,:] (just discard the output) and then calculate the loss of m(x[t,:]) and y[t+1]. The AD will take care of differentiating and updating parameters for all those previous calls through the hidden state. I haven’t used recurrent layers in a long time though so I might remember it wrong.

Yes, recurrent layers/models are impure functions because they have state. The model does not understand that you are feeding it the same timestep again (how should it know?!) so it will think it is the next step and adds the hidden state from the previous call. To signal that you intend to feed the model a new sequence I think you can call Flux.reset!(m) which should wipe all hidden state.

OK - thanks, that helps.

The beginning of my Chain here is LSTM(N,10).

In my example, I had intended only 1 predictor, but a window of 30.
Initially, I tried N=1 but it didn’t work, but it would only work with N=30.

Is N the product of the window length and the number of predictor variables, then?



I find myself with the same question now. I have multiple input features for one label.
So, if I have 3 features, and I want to use an input sequence of length 4, what would be a Flux LSTM architecture that can do this?
My attempt of Chain(LSTM(3,2), Dense(2,1))(rand(3,4)) returns a 1x4 Array. Basically, I’m trying to reproduce the “Multiple Input Series” from here If I understand correctly, that’s also what @compleat is trying to accomplish. @compleat did you figure this out in the meantime?


Actually, this post seems to suggest just ignoring the additional outputs and just taking the last one for making a “many-to-one” model:

The Flux recurrent models are explicit in their repeated application. To make your example work you would have to do:

using Flux

m = Chain(LSTM(3,2), Dense(2,1))

inputs = rand(3,4)

for t in 1:4
    output = m(inputs[:,t])
    @show output

This repeatedly applies the model m on new time steps of your data, updating the inner state at each application and producing output. A better way to organize your data would be the following:

inputs = [rand(3) for t in 1:4]
output = m.(inputs)

so you have your inputs as a vector of vectors and can use broadcasting for the repeated application of the model. Now it also becomes more clear how to apply the model in batches, because the second matrix dimension of your inputs is “free”.

So suppose you have data with 3 features, 10 samples and 4 times steps for each, then you would apply your model the following way:

data = rand(3,10,4)
inputs = [data[:,:,t] for t in 1:4]
output = m.(inputs)

Note that reset! here resets the inner state of the model and makes it ready to be applied for a different batch size, because the model state is saved individually for each batch member. Hopes this helps…


It does help, indeed. Thank you!

Yes, Thanks for that!

This is very helpful. And I would like to ask, the output of the model here is size 1x10 each. Because Dense has output 1 and there are 10 samples. I would like to ask how to make this a model that predicts future horizons 1:4 for example. That I have sequence on inputs, which is here 3x10 and it would predict me 1x4, combination of input into one output into 4 horizons, on which I see a loss with true values.

Because if I lag the targeted output by 4 it will predict me only ‘horizon=4’. So I would need a model for each horizon. Since RNN/LSTM are sequences, it should be possible to use one to predict the other.

Thank you very much for any comment/code.