How to arrange data for time series forecasting (mini-batching) without violoating the GPU memory for a LSTM?

Hi,

I would like to know how I have to arrange data for time series forecasting (mini-batching) without violoating the GPU memory for a LSTM. I tried to follow the instructions of @jeremiedb of the following link:

https://github.com/FluxML/Flux.jl/issues/1360#issuecomment-734028532

Here is my code:

# load flux module
using Flux

# gpu or cpu flag
gpu_or_cpu  = cpu

# ini variables
seq_len = 50
N = 82100
batch_size = 82050
no_features = 6
no_labels = 1

# mini-batching for time series
# X = rand(N, no_features)
# Y = rand(N, no_labels)
# X_tr, Y_tr = prepare_time_series_data_4_flux(X_train, Y_train, 50, 1)
# function prepare_time_series_data_4_flux(X, Y, seq_len)
#     N = size(X, 1)
#     num_batches = Int(floor((N-seq_len)))

#     X_batched = Vector{Array{Float32}}(undef, seq_len)
#     Y_batched = Float32.(Y[seq_len+1:end,:]')
#     for i in 1:seq_len
#         X_batched[i] = hcat([Float32.(X[i+j, :]) for j in 0:num_batches-1]...)        
#     end
#     return X_batched, Y_batched
# end

# short cut instead of mini batching
X_tr = [rand(Float32, 6, batch_size) for i in 1:seq_len]
Y_tr = rand(Float32, 1,50)

# convert to cpu or gpu
X_tr =  X_tr |> gpu_or_cpu
Y_tr = Y_tr |> gpu_or_cpu
data = (X_tr, Y_tr)

# select optimizer
opt = ADAM(0.001, (0.9, 0.999))

y_model = model.(X_tr)

# definition of the loss function
function loss(X,Y)
    Flux.reset!(model)
    mse_val = sum(abs2.(Y.-Flux.stack(model.(X), 1)[end, 1, :]))
    return mse_val
end

# ini of the model
model = Chain(LSTM(6, 70), LSTM(70, 70), LSTM(70, 70), Dense(70, 1, relu)) |> gpu_or_cpu
ps = Flux.params(model)
Flux.reset!(model)

# test loss
loss(data...)

# train one epoch
@time Flux.train!(loss, ps, [data], opt)

Actually I also run out of memory when I use the cpu. So does someone know what I´m doing wrong?

I´m working on Windows 10 with an i7,16 GB Ram and a NVIDIA Quadro P1000 with 4 GB GDDR5.

You’re feeding the model with a HUGE batch size of 82100. You likely want to break your data into batches of small size. Training a model with a single batch will likely be laborious, which is the point of stochastic gradient descent.

The data argument to Flux.train! is an an iterable, whose length = number of batches. In the above example, [data] is of length one, hence one epoch is made out of a single batch.

Consider partitioning the X and Y into chunks of a more reasonable batch size such as 64, 128 (you can think of how the 50K MNIST data is split into small batches).

For example, the following would represent a iterable input for training on a dataset made of 1000 batches of length 128 (that is where the total dataset comprises 128 000 observations, each with 50 steps):

seq_len = 50
batch_size = 128
num_batches = 1000
no_features = 6

X_tr = [[rand(Float32, no_features, batch_size) for i in 1:seq_len] for b in 1:num_batches]
Y_tr = [rand(Float32, batch_size, seq_len) for b in 1:num_batches]
data = zip(X_tr, Y_tr);

for d in data
  println(size(d[1]), size(d[2]))
end
(50,)(128, 50)
...
(50,)(128, 50)
5 Likes

Not sure if it’s the exact problem discussed here, but MLDataPattern.jl has support for lazily-evaluated time-series through slidingwindow. I have found it a bit mind-bending, but very expressive and useful. Once I’ve got the general syntax for it worked out, I wrap it into some another function.

However, as far as I know, there is no current out-of-the-box support for multi-variable time series batching, or at least not in the way you would do it in other languages. I had a similar conversation on Zulip about a month ago about the eventual need for this and how to work around it.

What I’m doing currently is batching the data to the shape I want after retrieving a batch from slidingwindow. You can see how I am handling this here, just look for the places where batch_ts is defined and used. This is almost certainly creating extra allocations and reducing the benefit of MLDataPattern, but… I haven’t thought of anything else, so comments are very welcome (and any on the entire PR, too, as I hope to make it a tutorial in the model-zoo).

1 Like

That was the missing information. Honestly now I struggle a little bit with the loss. I thought I wouldn´t have to change it because it would be automatically unpacked by Flux, but it doesn´t. Do I have to make a for loop over the number of batched on my own in the loss?

function loss(X,Y)
    mse_val = 0.0
    for i in 1:length(X)
        Flux.reset!(model)
        mse_val += sum(abs2.(vec(model.(X[i])[end]) .- Y[i]))
    end
    return mse_val
end

Is there an example above all this stuff?

In comparison one epoch for the same data format takes in keras approximately an hour, but with Flux it wasn´t finished after 3h. Both calculated on cpu.

The beneath code is working for cpu and gpu when the CUDA.allowscalar(true). But I still have some problems with LSTMs using Flux.

First problem: If I change in beneath code seq_length = 6000 and batch_size = 1 then the evaluation will not finish (I kept it computing for one night). I don’t understand why, because the amount of data inside one batch is the same compared to seq_length = 600 and batch_size = 10, which needs approx. 100 s to evaluate.

Second problem: If I change to CUDA.allowscalar(false) (as it is suggested in the REPL for a faster evaluation) I get the following error message.

“LoadError: scalar getindex is disallowed”

I’m running the code on a Windows 10 machine with an i7 cpu, 16 gb memory and Nvidia Quadro P1000 gpu. My Julia version is 1.6.0, Flux v0.12.0 and CUDA v2.6.2.

using JLD2
using Flux
using CUDA


X_train = rand(80000, 4)
Y_train = rand(80000, 1)

X_train = rand(60000, 4)
Y_train = rand(60000, 1)

seq_len = 600
batch_size = 10

no_features = 4
N = size(X_train,1)
num_batches = Int(floor(N/(batch_size*seq_len)))

# mini batching
X_batched = [[hcat([Float32.(X_train[j+i*seq_len+(k*(batch_size)*seq_len), :]) for i in 0:batch_size-1]...) for j in 1:seq_len] for k in 0:num_batches-1]
Y_batched = [vcat([Float32.(Y_train[seq_len * i + 1 + (k*(batch_size)*seq_len) : (i+1) * seq_len + (k*(batch_size)*seq_len)]') for i in 0:batch_size-1]...) for k in 0:num_batches-1]

gpu_or_cpu = cpu

if gpu_or_cpu ==gpu
    CUDA.allowscalar(false)
end

# convert to cpu or gpu
X_batched =  X_batched |> gpu_or_cpu
Y_batched = Y_batched |> gpu_or_cpu
data_train = (X_batched, Y_batched)

# select optimizer
opt = ADAM(0.001, (0.9, 0.999))

# definition of the loss function
function loss(X,Y)
    mse_val = 0.0
    for i in 1:length(X)
        Flux.reset!(model)
        mse_val += sum(abs2.(vcat(model.(X[i])...)'.-Y[i]))
    end
    return mse_val
end

# ini of the model
model = Chain(LSTM(4, 70), LSTM(70, 70), LSTM(70, 70), Dense(70, 1, relu)) |> gpu_or_cpu
ps = Flux.params(model)
Flux.reset!(model)

# train one epoch
@time Flux.train!(loss, ps, [data_train], opt)

Could you provide a minimal example of why you resorted on looping over length(X) in the loss function? This likely cause some issue.
I’d suggest looking at how the loss functions are defined the latest doc, maybe it is the stack function that is your missing ingredient. Also, I think you should have used zip: data_train = zip(X_batched, Y_batched), also shown in the docs.

To validate data iterator behave as expected, it can be useful to do something like:

for (x, y) in train_data
   println("size x: ", size(x), " | size(y) : ", size(y))
end

About the size of the data, seq_length batch size won’t be interchangeable performance wise as a lot efficiency in larger matrix multiplications will be lost with batch size = 1. Depending on your data/model setting, it is also often non interchangeable from the soundness of the model per se as having a single very large seq length means you carry the gradient calculation all along that very long seq_length (which may as well make it hard on the gradient calculation). I don’t have an extensive view on that matter, so please don’t take this for granted, but I’ve not been aware of situations where the gradient would be carried over a length well over ~100.

2 Likes

To build on this point, the AD graph for an RNN is roughly a heavily unbalanced tree. Truncated backpropagation through time can reduce the size of this tree, but it grows super-linearly with the size of the input. As such, you can see why any framework would have difficulty dealing with a size 6000 sequence, and indeed that’s part of the reason you don’t see RNNs used very often for sequence lengths of more than 1000 or so.

3 Likes

Nice, I haven’t seen this. I changed the code according to your suggestions and now it is much faster and even the usage of the gpu accelerates the training approx. by the factor 20.

Furthermore, I may have discovered an error. When I simulate the trained model with a test data set longer than 10000 data points, I get a StackOverflowError due to the Flux.stack function

x_test= [vec(Float32.(X_test[i,:])) for i in 1:size(X_test,1)]

y_model = Flux.stack(model.(x_test),1)

it doesn’t occur with

x_test= [vec(Float32.(X_test[i,:])) for i in 1:size(X_test,1)]

y_model = vcat(model.(x_test)...)

Thanks for pointing this out.

2 Likes