Simple Flux LSTM for Time Series

I’ll be happy to make a PR once I have it a bit more polished.

3 Likes

An updated file is here: https://github.com/mcreel/model-zoo/tree/master/contrib/timeseries/AR1_lstm This now does train/test split and stops training when there is no improvement.

I will probably improve the README, then make a PR. This works very well, a comparison of the neural net forecast with the maximum likelihood forecast is in the image.

image

2 Likes

Yes, thank so much for that - it is a great help.

I have one question (which may be off-base). In sequences of LSTM forecasts, it seems that one would not want to include the early forecasts in the sequence after reset within the loss (or perhaps, maybe one might only be interested in the final forecast?)
Does this concern make sense? And if so, how might one incorporate this?
Thanks again?

Very nice, but the link is not working…

Perhaps you mean: https://github.com/mcreel/model-zoo/blob/master/contrib/timeseries/AR1_LSTM.jl

yes, thanks, I moved it to what seems like a better organization. I made a PR.

That’s a good question. The way I wrote the loss function, the first element in a batch is fit using only 1 lag, the second uses two lags, and so forth, up to the last element, which uses the full set of lags. However, every observation in the sample will appear as the last element in a batch, so the full conditioning information will be used. It’s just that partial conditioning information is also mixed in. How to do this optimally is not clear to me. There is certainly a lot of room for experimenting with loss functions and configurations of models.

I think that the loss you propose can easily be defined by looking at only the last element of the batch:

function loss(x,y)
    Flux.reset!(m)
    Flux.mse(m(x)[end],y[end])
end

Working with this, I get forecasts that look pretty much the same as with the original version, but I haven’t done any formal measurements as to which might perform better.

Great. Thanks! I had tried something like this but got mixed up with dimensions.

In this case, am I right that one only needs the last element of the batch?
And couldn’t one then make up batches of these single sequences?
(That is what I was trying to do, anyway)

I suppose that there may be other ways of doing the same thing, but I find the whole procedure of how to make batches for recursive models to be pretty confusing. I tried this myself several times over the last couple of years. This is the first time I was able to get the model to actually predict well, and I was very pleased when I saw that happen.

I’d be very interested in seeing alternative approaches that make better use of the available tools.

Unfortunately, the solution I proposed isn’t really making proper use of the LSTM first layer, as it is not being provided with a sequence of data in each batch. Each batch is only one vector. So, there is no use of the state of the LSTM, and it’s acting like a dense layer. In fact, if you just replace it with a dense layer with nonlinear activation, it still works as well (which is not surprising, as it’s just a straightforward regression problem). So, this needs more work, to make each batch provide a number of sequences, each of which contains a number of lags of the dependent variable. So, back to the drawing board, starting with reading the docs more carefully,

Thank you for the provided code!

I have made slight adjustments such that the recurrence property of the LSTM layer should be correctly used. However, this now needs many more training epochs. Here’s the gist.

Also, on a sidenote, I believe that on line 77, you should use bestmodel = deepcopy(m), otherwise bestmodel simply points to m and will be updated even when m performs worse.

I’m not convinced what I did is very good but perhaps it can help you perfect your example. I will definitely try my hand at writing other examples of recurrent network tutorials for model-zoo in the coming days/weeks.

Thanks very much! From a quick read, I think that in your code the loss is computed by running through the full sample, using the difference between y(t) and the prediction from the model using x(t)=y(t-1) and the state. I think that a potentially useful next step might be to break this into batches of y(t-p)…y(t-1) plus the state affecting the prediction of y(t), and resetting the state in the loss function between every new prediction. I will be working on that idea, I’m hopeful that learning will be better. I’ll post updates here when I have something useful.

You are correct. While I’m at it, if you have any resources on the subject I’d be happy to read through them, I have a tough time wrapping my head around how long my sequence lengths should be and how big my batches should be. If it’s not implicitly given by the problem, I feel like it is a somewhat arbitrary choice.

For instance, admit we have some volatility series over many years, one could compute the loss on the full series or break it into monthly or yearly batches. I’m still trying to understand what makes the most sense and why.

Finally, as I have given myself headaches with the recurrent data format in Flux every single time I tried my hand at some RNN project, I have at some point created the following helpers, perhaps they can also be of some use to you:

"""
    tabular2rnn(X)
Converts tabular data `X` into an RNN sequence format. 
`X` should have format T × K × M, where T is the number of time steps, K is the number 
of features, and M is the number of batches.
"""
tabular2rnn(X::AbstractArray{Float32, 3}) = [X[t, :, :] for t ∈ 1:size(X, 1)]

"""
    rnn2tabular(X)
Converts RNN sequence format `X` into tabular data.
"""
rnn2tabular(X::Vector{Matrix{Float32}}) = permutedims(cat(X..., dims=3), [3, 1, 2])

Thanks! About batching, number of lags to use in a batch should be related to the degree of dependency in the data, in general. With highly temporally dependent data, longer batches will be good, because data at longer lags still has significant correlation with the current observation.

The AR1 model will have higher correlations between lags when rho is close to 1, so that would suggest using more lags per batch, However, the AR1 model is a special case: conditioning on y(t-1), we capture all information, so, in fact, for this model, one lag is in fact enough. For more complicated models, it may not be possible to come up with a result like this, which would suggest following the rule of using more lags when the correlations die off slowly.

I think that we should put together some resources and eventually make a PR for the model zoo. I will be off line for the next few days, but I’m going to keep on with this when I get back.

1 Like

Thanks, that is getting a bit clearer, but in Flux I am used to inputting X’s and y’s together in tuples. Do you have any comment about how to build the dataset for inputting both X and y together (like Flux seems to expect)?

Thanks again

For State Space models, like Kalman filters (and LSTM), there exists the notion of ‘steady state’ operation, which means that the length of the series prior to the forecast does not have to be fixed and definite, but just sufficiently long for the system to have achieved steady-state operation (this would not be true for RNN’s in general).

My inclination is to go back to the beginning of whatever sequence is being studied, for every element of every batch, but to include only the error of the final (or last fixed number of) forecast(s), which is what I have been trying to do.

I think I may have figured out how to do what I want. I think batching doesn’t work here, and I need to think of the entire sequence as 1 observation and 1 epoch.

datamat=[(data[1:end-1]',data[2:end]')];

function loss(x,y)
    Flux.reset!(m)
    Flux.mse(m(x)[:,100:end],y[:,100:end])  # do not include first 100 of seq. in the error
end

m = Chain(LSTM(1, 10), Dense(10,2, tanh), Dense(2,1))

function cb()
    println(loss(datamat[1][1],datamat[1][2]))
    flush(stdout)
end

epochs=100
Flux.@epochs epochs Flux.train!(loss,Flux.params(m), datamat, ADAM(),cb=Flux.throttle(cb,1000))

This trains very well and seems to do what I want it to do.

I wanted to restart writing blog posts and figured this might be a good start and I wrote up a few sentences this afternoon. Perhaps this can be of use to you: A Simple Recurrent Model in Flux | Jonathan Chassot

If self-promotion is not tolerated I’ll make sure to remove it, but I think it explains how to work out the case with both X and y. In general, you just want to reshape your X and not necessarily your y.

To summarize, I just keep both my X and y separated and I don’t use the Flux.train!() function but rather compute the gradients and use Flux.update!(). This is something that was suggested to me by someone more knowledgeable about RNNs.

4 Likes

Thanks for that. I will play around with that.

Incidentally, when I try to use the syntax

LSTM(in::Integer, out::Integer, σ = tanh)

as in the model reference, I get an error:

MethodError: no method matching Flux.LSTMCell(::Int64, ::Int64, ::typeof(tanh))
Closest candidates are:
  Flux.LSTMCell(::A, ::A, ::V, ::S) where {A, V, S} at ~/.julia/packages/Flux/qAdFM/src/layers/recurrent.jl:208
  Flux.LSTMCell(::Integer, ::Integer; init, initb, init_state) at ~/.julia/packages/Flux/qAdFM/src/layers/recurrent.jl:214

which is the same for any activation function I try. Replacing LSTM by RNN above works fine.

Do you happen to know why the syntax gven in the model reference does not work?

With LSTM or GRU you don’t specify the activation function, only for RNN layers, the correct way would be

LSTM(in::Integer, out::Integer)

Hi thanks. The model reference is this:

LSTM(in::Integer, out::Integer, σ = tanh)

[I just want to avoid things blowing up on recursion]