GPU performance and switching from tabular to recurrent data format for Flux.jl

When training a recurrent network using Flux.jl, a dataset with K features, S samples, and a sequence length of L should take the form of a vector of length L where each element is a K × S matrix.

Let’s say that I have some univariate time series (K=1) with L=100 and S=10_000.

If my data is in tabular format, where the rows represent the timesteps and each column is the realizations of a sample, I have an L × S matrix.

A quick and easy way to transform this L × S matrix to a the necessary format for recurrence in Flux is to use the following function:

tabular2rnn(X) = [X[i:i, :] for i ∈ 1:size(X, 1)]

This works well enough, however, for some reason this function seams to be incredibly slow on the GPU, consider the following code:

using BenchmarkTools
using Flux

tabular2rnn(X) = [X[i:i, :] for i ∈ 1:size(X, 1)]

function batching(X, Y, batchsize)
    for idx ∈ Iterators.partition(1:size(X, 2), batchsize)
        Xb, Yb = X[:, idx], Y[:, idx]
        tabular2rnn(Xb)
    end
    nothing
end

X_cpu, Y_cpu = randn(Float32, 100, 10_000), randn(Float32, 100, 10_000)
X_gpu, Y_gpu = gpu(X_cpu), gpu(Y_cpu)

@benchmark batching(X_cpu, Y_cpu, 32)

@benchmark batching(X_gpu, Y_gpu, 32) 

The code on the CPU runs in approximately 3.8 ms, while the one on the GPU runs in 192 ms. How can I go about speeding up this code for the GPU? Any suggestions are much appreciated, thanks!

I also tried to work with batchseq from the MLUtils.jl package but wasn’t successful in improving the runtime.

If dim 1 is large, repeatedly slicing X into a bunch of tiny copies is going to cause some overhead. eachslice(X, dims=1) is likely better since it tries to make views instead. Flux uses its own internal function to be more AD-friendly while evaluating Flux.jl/recurrent.jl at master · FluxML/Flux.jl · GitHub, so if that’s something you need it may be helpful inspiration.

1 Like

Thanks!

Using

tabular2rnn(X) = [permutedims(x) for x ∈ eachslice(X, dims=1)]

instead of what I was using reduces the CPU runspeed to roughly 2.5 ms and the GPU runspeed to 3.4 ms. Quite the improvement. If I find some inspiration on how to improve it even more using your links I’ll post an update here, until then, this is a good solution!

1 Like

Welp, now I’m having different issues with this solution. Consider the following code:

tabular2rnn(X) = [permutedims(x) for x ∈ eachslice(X, dims=1)]

m = Chain(
    LSTM(1, 10),
    Dense(10, 1)
) |> gpu


X1 = [gpu(randn(1, 100)) for _ ∈ 1:100]
X2 = tabular2rnn(gpu(randn(100, 100)))

[m(x) for x ∈ X1] # Runs fine
[m(x) for x ∈ X2] # Scalar indexing on GPU array

I’m not sure why the second version is doing scalar indexing on a GPU array but not the first, is this due to the view?

Have you tried without the permutedims? I think that may be messing things up.

1 Like

Indeed, that does work without doing scalar indexing on GPU array…

However, the problem I have in this case is that eachslice(...) returns a (row-)vector and I need the data to be a column-vector as each slice should represent a 1 × sample_size vector. I also tried doing reshape(x, 1, :) but the scalar indexing still happens.

EDIT

Okay, looking at the source code for eachslice, here’s a solution that works without doing scalar indexing and still performs decently for the GPU (~2.2 ms on CPU and ~3 ms on GPU according to the above benchmark).

tabular2rnn(X) = [view(X, i:i, :) for i ∈ 1:size(X, 1)]

Thanks for all the help.

1 Like