Flux on GPU too slow

Hi, I am training a number of NNs with Flux and Cuda. However, the training procedure is not accelerated as expected on GPU. I used the CuIterator as described in the docs for GPU training. Any ideas on how can I accelerate the training procedure? My code is mainly the following.

using Flux
using CUDA


CUDA.allowscalar(false)


X = rand(6, 20000)
Y = rand(1, 20000)

data = Flux.DataLoader((X,Y), batchsize = 64)

modelgpu = Chain(Dense(6,256, elu), Dense(256,256, elu), Dense(256,256, elu), Dense(256,1)) |>gpu
modelcpu = cpu(modelgpu)

function training(model, train::Flux.DataLoader; 
    epochs::Int = 100, opt = Flux.Adam(1e-3), loss_fun = Flux.Losses.mse, 
    GPU::Bool)
    par_model   = Flux.params(model)
    for ep=1:epochs
        if GPU
            for (x, y) in CuIterator(train)
                ∇ = gradient(par_model) do
                    loss_fun(model(x),y)
                end
                Flux.Optimise.update!(opt, par_model, ∇)
            end
        else
            for (x, y) in train
                ∇ = gradient(par_model) do
                    loss_fun(model(x),y)
                end
                Flux.Optimise.update!(opt, par_model, ∇)
            end
        end
    end
end


@time training(modelcpu, data; 
    epochs = 500, opt = Flux.Adam(1e-3), loss_fun = Flux.Losses.mse, 
    GPU=false) 

@time training(modelgpu, data; 
    epochs = 500, opt = Flux.Adam(1e-3), loss_fun = Flux.Losses.mse, 
    GPU=false) 

The result is

GPU:

397.086585 seconds (542.80 M allocations: 35.734 GiB, 3.18% gc time, 3.75% compilation time)

CPU

496.777942 seconds (22.23 M allocations: 834.203 GiB, 4.65% gc time)

Version Info

Julia Version 1.8.1
Commit afb6c60d69a (2022-09-06 15:09 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
  Threads: 1 on 8 virtual cores

CUDA info

CUDA toolkit 11.7, artifact installation
NVIDIA driver 515.65.1, for CUDA 11.7
CUDA driver 11.7

Libraries: 
- CUBLAS: 11.10.1
- CURAND: 10.2.10
- CUFFT: 10.7.2
- CUSOLVER: 11.3.5
- CUSPARSE: 11.7.3
- CUPTI: 17.0.0
- NVML: 11.0.0+515.65.1
- CUDNN: 8.30.2 (for CUDA 11.5.0)
- CUTENSOR: 1.4.0 (for CUDA 11.5.0)

Toolchain:
- Julia: 1.8.1
- LLVM: 13.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

1 device:
  0: NVIDIA GeForce GTX 1050 (sm_61, 3.863 GiB / 4.000 GiB available)

isn’t this argument in the function the problem? data are not at GPU?

I copy paste it wrong. It was ‘GPU =true’ when I run it. Furthermore, I managed to bring it down to 250 s by using ‘@epochs’ and ‘Flux.train’ but still I think is too slow.

Maybe try a larger batch size? Like 20000.

Is generating arrays of Float64. Your GPU (because of market segmentation, among other things) only runs Float64 ops at a fraction of the speed of Float32. Have a try changing the rand(...) to rand(Float32, ...) and see how that behaves.

2 Likes

Ok I run as you suggested, my model, and I also run the same model on Keras for 100 epochs. I run both on Colab. The results were: 141.95 s for Keras and 43.67 s for Flux, both on GPU.

So, I guess at that batchsize no further improvement can be achieved, as @skleinbo suggested.

Python script

import numpy as np
import time
from tensorflow.keras import layers
import tensorflow.keras 

X = tf.random.uniform((20000,6))
Y = tf.random.uniform((20000,1))

model = tensorflow.keras.Sequential([
    layers.Dense(256, input_dim=6, activation='elu'),
    layers.Dense(256, activation='elu'),
    layers.Dense(256, activation='elu'),
    layers.Dense(1)
    ])

model.compile(loss='mse', optimizer='adam')

start_time = time.time()
model.fit(X, Y, verbose=1, epochs=100, batch_size = 64)
end_time = time.time() -start_time
print(end_time)
2 Likes

I’m surprised the Keras times aren’t better! There’s definitely some fixed compilation overhead on the Flux side still, but I’m glad we’re competitive for this example.