Following upgrade of Flux from CUDA 1 to CUDA 2, operations running on CUDA appear to allocate more as well as being slightly slower.
For example, on a LSTM, with CUDA 2.1 :
8.572 s (16390042 allocations: 3.92 GiB)
8.102 s (18654053 allocations: 580.29 MiB)
Could it be that the memory allocations from CUDNN calls were simply not captured previously?
With a more basic Dense model, the allocations with CUDA 2.1 are now slightly lower, but is clearly slower than with 1.3.3, which seems to hint to allocations not captured by CUDNN call, but still pointing to performance regression with newest CUDA 2.1.
245.949 ms (119956 allocations: 3.53 MiB)
144.685 ms (127057 allocations: 3.74 MiB)
To reproduce, I used Flux v0.11.2 (comes with CUDA 2.1). And for the CUDA 1.3.3 comparison, I used the same Flux v0.11.2, removed CUDA and
Code for the Dense example:
using Revise using Flux using Statistics: mean using Random: seed! # illustrate diverging behavior of GPU execution seed!(123) feat = 64 hidden = 256 batch_size = 1024 m_cpu = Chain(Dense(feat, hidden, relu), Dense(hidden, hidden, relu), Dense(hidden, 1)) X = rand(Float32, feat, batch_size) Y = rand(Float32, batch_size) ./ 10 m_gpu = m_cpu |> gpu X_gpu = gpu(X) Y_gpu = gpu(Y) θ_gpu = Flux.params(m_gpu) function loss_gpu(x, y) l = mean((m_gpu(x) .- y).^2) return l end opt_gpu = Descent(1e-3) function speed_gpu(n=10) for i in 1:n Flux.train!(loss_gpu, θ_gpu, [(X_gpu, Y_gpu)], opt_gpu) end return loss_gpu(X_gpu, Y_gpu) end using BenchmarkTools @btime speed_gpu(100)