Getting the most utilization out of a GPU

I am using a relatively high spec GPU (1080-TI) but cannot get it to run at utilizations of more than 5% (according to task manager). I believe that it is being used (rather than the CPU) because the “GPU memory” goes up by about 1GB when I run the below code (also the datatype returned by the neural net is CuArrays.CuArray{Float32,2,Nothing}). I was wondering if anyone has an idea on how change the code/learning algorithm in order to make more use of the GPU? Or generally is very low utilization to be expected (maybe my simple test case is too simple for GPU use)?

using Statistics
using Flux, CUDA
using Random

# Making dummy data
obs = 1000000
x = rand(Float64, 10 , obs)
y = mean(x, dims=1) + sum(x, dims=1)
y[findall(x[4,:] .< 0.3)] .= 17 # Making it slightly harder.

x = x |> gpu
y = y |> gpu

opt = Descent()
# With a CPU
m_cpu = Chain(Dense(10,6),
m_gpu = m_cpu |> gpu
using CuArrays

dataset_gpu = Flux.Data.DataLoader(x, y, batchsize=2^12, shuffle=true) |> gpu 
loss_gpu(A, B) = Flux.mae(m_gpu(A),B)
println("Doing GPU training")
loss_gpu(x, y)
for i in 1:100 Flux.train!(loss_gpu, params(m_gpu), dataset_gpu, opt) end
loss_gpu(x, y)

I think this question is related to this one but there did not seem to be a conclusion here

The 1GB initial allocation is most likely caused by these lines:

x = x |> gpu
y = y |> gpu

(running a DataLoader through gpu is more or less a no-op and not necessary)

WRT performance, I was able to obtain a consistent 60-80% utilization on a comparable GPU (RTX 2070). This is using nvidia-smi, which should be more accurate than task manager. If you’re able to test on a linux machine or WSL, that should help eliminate the OS/environment as a variable. To get a more detailed picture of bottlenecks, you could also look into CUDA.jl’s profiling documentation.

1 Like

Thanks for this.You were completely right on task manager.

When I started training I ran the following in the command line
C:\Program Files\NVIDIA Corporation\NVSMI\nvidia-smi.exe
I found I actually had usage at 30%.

Then I could increase the training batchsize to 2^18 to get usage up to the 80% range.