Following upgrade of Flux from CUDA 1 to CUDA 2, operations running on CUDA appear to allocate more as well as being slightly slower.
For example, on a LSTM, with CUDA 2.1 :
8.572 s (16390042 allocations: 3.92 GiB)
CUDA 1.3.3:
8.102 s (18654053 allocations: 580.29 MiB)
Could it be that the memory allocations from CUDNN calls were simply not captured previously?
With a more basic Dense model, the allocations with CUDA 2.1 are now slightly lower, but is clearly slower than with 1.3.3, which seems to hint to allocations not captured by CUDNN call, but still pointing to performance regression with newest CUDA 2.1.
CUDA 2.1:
245.949 ms (119956 allocations: 3.53 MiB)
CUDA 1.3.3:
144.685 ms (127057 allocations: 3.74 MiB)
To reproduce, I used Flux v0.11.2 (comes with CUDA 2.1). And for the CUDA 1.3.3 comparison, I used the same Flux v0.11.2, removed CUDA and add CUDA#v1.3.3
.
Code for the Dense example:
using Revise
using Flux
using Statistics: mean
using Random: seed!
# illustrate diverging behavior of GPU execution
seed!(123)
feat = 64
hidden = 256
batch_size = 1024
m_cpu = Chain(Dense(feat, hidden, relu),
Dense(hidden, hidden, relu),
Dense(hidden, 1))
X = rand(Float32, feat, batch_size)
Y = rand(Float32, batch_size) ./ 10
m_gpu = m_cpu |> gpu
X_gpu = gpu(X)
Y_gpu = gpu(Y)
θ_gpu = Flux.params(m_gpu)
function loss_gpu(x, y)
l = mean((m_gpu(x) .- y).^2)
return l
end
opt_gpu = Descent(1e-3)
function speed_gpu(n=10)
for i in 1:n
Flux.train!(loss_gpu, θ_gpu, [(X_gpu, Y_gpu)], opt_gpu)
end
return loss_gpu(X_gpu, Y_gpu)
end
using BenchmarkTools
@btime speed_gpu(100)