When training a recurrent network using Flux.jl, a dataset with K
features, S
samples, and a sequence length of L
should take the form of a vector of length L
where each element is a K × S
matrix.
Let’s say that I have some univariate time series (K=1
) with L=100
and S=10_000
.
If my data is in tabular format, where the rows represent the timesteps and each column is the realizations of a sample, I have an L × S
matrix.
A quick and easy way to transform this L × S
matrix to a the necessary format for recurrence in Flux is to use the following function:
tabular2rnn(X) = [X[i:i, :] for i ∈ 1:size(X, 1)]
This works well enough, however, for some reason this function seams to be incredibly slow on the GPU, consider the following code:
using BenchmarkTools
using Flux
tabular2rnn(X) = [X[i:i, :] for i ∈ 1:size(X, 1)]
function batching(X, Y, batchsize)
for idx ∈ Iterators.partition(1:size(X, 2), batchsize)
Xb, Yb = X[:, idx], Y[:, idx]
tabular2rnn(Xb)
end
nothing
end
X_cpu, Y_cpu = randn(Float32, 100, 10_000), randn(Float32, 100, 10_000)
X_gpu, Y_gpu = gpu(X_cpu), gpu(Y_cpu)
@benchmark batching(X_cpu, Y_cpu, 32)
@benchmark batching(X_gpu, Y_gpu, 32)
The code on the CPU runs in approximately 3.8 ms, while the one on the GPU runs in 192 ms. How can I go about speeding up this code for the GPU? Any suggestions are much appreciated, thanks!
I also tried to work with batchseq
from the MLUtils.jl package but wasn’t successful in improving the runtime.