Gradient on GPU 70X slower than on CPU

I just started using Flux.jl. I have a loss function whose gradient is calculated much slower on GPU than on CPU. As a machine learning project typically involves a large dataset, it is difficult to create a MWE, but I wonder if someone have any suggestions by looking at the types of the arguments.

Here are the packages I am using:

using Flux
using Flux: onehotbatch, onecold, crossentropy, params
using CUDA

I have a loss function

loss(x, y) = crossentropy(m(x), y)

defined with some model m. On CPU, I can evaluate the gradient of the loss function in 10 secs:

julia> @timev gradient(()->loss(t...), ps)
  8.022828 seconds (4.67 M allocations: 7.465 GiB, 16.80% gc time, 38.20% compilation time)
elapsed time (ns):  8022828062
gc time (ns):       1347510721
bytes allocated:    8014950633
pool allocs:        4671353
non-pool GC allocs: 512
malloc() calls:     290
free() calls:       100
minor collections:  3
full collections:   3
Grads(...)

Here the type of t is

julia> typeof(t)
Tuple{Array{Float32, 3}, OneHotArrays.OneHotMatrix{UInt32, 10, Vector{UInt32}}}

and the types of the elements of ps are a combination of 1D, 2D, and 3D arrays:

julia> typeof.(ps)
62-element Vector{DataType}:
 Array{Float32, 3}
 ⋮
 Vector{Float32} (alias for Array{Float32, 1})
 ⋮
 Matrix{Float32} (alias for Array{Float32, 2})
 ⋮

Now, to perform the gradient on GPU, I defined a new loss function as follows:

m2 = m |> gpu
loss2(x,y) = crossentropy(m2(x), y)

I also prepared the CUDA version of t and ps:

julia> typeof(t2)
Tuple{CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}, OneHotArrays.OneHotMatrix{UInt32, 10, CuArray{UInt32, 1, CUDA.Mem.DeviceBuffer}}}

julia> typeof.(ps2)
62-element Vector{DataType}:
 CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}
 ⋮
 CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}
 ⋮
 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}
 ⋮

You can see that the standard Arrays in t and ps are converted to the corresponding CuArrays in t2 and ps2.

However, when I peform gradient on loss2 using t2 and ps2, it takes much longer:

julia> @timev gradient(()->loss2(t2...), ps2)
559.368659 seconds (209.64 M allocations: 31.599 GiB, 0.86% gc time, 0.26% compilation time)
elapsed time (ns):  559368659181
gc time (ns):       4785176794
bytes allocated:    33928929401
pool allocs:        209638516
non-pool GC allocs: 61
malloc() calls:     9
free() calls:       223
minor collections:  101
full collections:   0
Grads(...)

So the gradient calculation on GPU is ~70X slower than on CPU.

Any suggestions on how to tackle this issue? Even if not a direct solution to the issue, I will appreciate any suggestions that help understanding the cause.

2 Likes

Which GPU and CPU are you using?

What does CUDA.@time gradient(()->loss2(t2...), ps2) say? Compilation latency is higher for models on GPU, but 70x higher would be pretty extreme.

@jmair:

  • CPU: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
  • GPU: NVIDIA Volta V100

@ToucheSir:
Here is the CUDA.@time result:

julia> CUDA.@time gradient(()->loss2(t2...), ps2)
562.157332 seconds (209.62 M CPU allocations: 31.598 GiB, 0.82% gc time) (469 GPU allocations: 7.648 GiB, 0.01% memmgmt time)
Grads(...)

The result is not much different from @timev.

I would like to note that the loss function itself is evaluated much faster on GPU:

julia> @timev loss(t...)
  1.066440 seconds (1.27 k allocations: 1.760 GiB, 4.48% gc time)
elapsed time (ns):  1066439502
gc time (ns):       47781243
bytes allocated:    1889943472
pool allocs:        1192
non-pool GC allocs: 2
malloc() calls:     72
free() calls:       71
minor collections:  4
full collections:   0
2.311946f0

julia> @timev loss2(t2...)
  0.016737 seconds (3.04 k allocations: 154.562 KiB)
elapsed time (ns):  16737110
gc time (ns):       0
bytes allocated:    158272
pool allocs:        3038
non-pool GC allocs: 0
minor collections:  0
full collections:   0
2.3874822f0

Also, I am using up-to-date Flux.jl and CUDA.jl on Julia 1.8:

(@v1.8) pkg> st Flux
Status `~/.julia/environments/v1.8/Project.toml`
  [587475ba] Flux v0.13.6

(@v1.8) pkg> st CUDA
Status `~/.julia/environments/v1.8/Project.toml`
  [052768ef] CUDA v3.12.0
1 Like

Yeah, doesn’t look like GPU memory usage is an issue if you have access to an entire V100. Assuming ps2 = params(m), can you try timing the following:

loss2(m, x, y) = crossentropy(m(x), y)
...
gradient(loss2, m, t2...)

If possible, it would be great if you could share your model/loss definition code and a short snippet to run both on some dummy data. It won’t be a MWE, but it’d be something reproducible we can play with.

Edit: I would also recommend testing in a newly created environment in case something else in your global one is causing certain dependencies of Flux to be accidentally downgraded. Then you have the option to try something like this:

# ] add GPUCompiler

using GPUCompiler
GPUCompiler.enable_timings()

# setup ...

loss2(m, x, y) = crossentropy(m(x), y)
gradient(loss2, m, t2...)

GPUCompiler.timings()

This would let us know how much time is tied up in compiling GPU kernel code.

1 Like

Do you see any scalar indexing warning? You can disable it by CUDA.allowscalar(false)

4 Likes

@avikpal, Indeed! Calculating the gradient after CUDA.allowscalar(false) generated an error:

julia> CUDA.allowscalar(false)

julia> gradient(()->loss2(t2...), ps2)
ERROR: Scalar indexing is disallowed.
...

I didn’t think this was an issue, because I thought all the variables and parameters were properly converted to CuArray. What could be the source of scalar indexing? Is there a strategy to identify it?

@ToucheSir, I tried your suggested gradient(loss2, m, t2...) after CUDA.allowscalar(false), and I get the same error. I guess once I solve the scalar indexing issue, my problem will be resolved.

2 Likes

If you look at the stack trace, it should tell you which function is causing scalar indexing. Most likely part of your model has not implemented the Functors interface (e.g. by using @functor) and so only some parameters have been moved to the GPU.

I don’t know much about @functor, but I am defining my model as

struct MyModel
    ...
end

Flux.@functor MyModel

and creating a GPU-compatible model by

m = MyModel() |> gpu

Is there anything else I need to do?

That definition looks fine, but there may be layers nested more deeply in MyModel which we’re not seeing and are problematic. That’s why I recommend looking at the stacktrace. It’ll tell you exactly where the scalar indexing happened and make it much easier to find the root cause. Feel free to post that here too.

Thanks for willing to look into the stack trace. Here it is:

julia> gradient(()->loss(t2...), ps2)
ERROR: Scalar indexing is disallowed.
Invocation of getindex resulted in scalar indexing of a GPU array.
This is typically caused by calling an iterating implementation of a method.
Such implementations *do not* execute on the GPU, but very slowly on the CPU,
and therefore are only permitted from the REPL for prototyping purposes.
If you did intend to index this array, annotate the caller with @allowscalar.
Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:35
  [2] assertscalar(op::String)
    @ GPUArraysCore ~/.julia/packages/GPUArraysCore/lojQM/src/GPUArraysCore.jl:87
  [3] getindex(::CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}, ::Int64, ::Int64, ::Int64)
    @ GPUArrays ~/.julia/packages/GPUArrays/fqD8z/src/host/indexing.jl:9
  [4] getindex
    @ ./permuteddimsarray.jl:71 [inlined]
  [5] _unsafe_getindex_rs
    @ ./reshapedarray.jl:250 [inlined]
  [6] _unsafe_getindex
    @ ./reshapedarray.jl:247 [inlined]
  [7] getindex
    @ ./reshapedarray.jl:235 [inlined]
  [8] _generic_matmatmul!(C::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, tA::Char, tB::Char, A::Base.ReshapedArray{Float32, 2, PermutedDimsArray{Float32, 3, (2, 1, 3), (2, 1, 3), CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}}, Tuple{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}, Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}}}, B::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, _add::LinearAlgebra.MulAddMul{true, true, Bool, Bool})
    @ LinearAlgebra ~/pkg/julia/julia-1.8/usr/share/julia/stdlib/v1.8/LinearAlgebra/src/matmul.jl:876
  [9] generic_matmatmul!(C::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, tA::Char, tB::Char, A::Base.ReshapedArray{Float32, 2, PermutedDimsArray{Float32, 3, (2, 1, 3), (2, 1, 3), CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}}, Tuple{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}, Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}}}, B::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, _add::LinearAlgebra.MulAddMul{true, true, Bool, Bool})
    @ LinearAlgebra ~/pkg/julia/julia-1.8/usr/share/julia/stdlib/v1.8/LinearAlgebra/src/matmul.jl:844
 [10] mul!
    @ ~/pkg/julia/julia-1.8/usr/share/julia/stdlib/v1.8/LinearAlgebra/src/matmul.jl:453 [inlined]
 [11] mul!
    @ ~/pkg/julia/julia-1.8/usr/share/julia/stdlib/v1.8/LinearAlgebra/src/matmul.jl:276 [inlined]
 [12] *
    @ ~/pkg/julia/julia-1.8/usr/share/julia/stdlib/v1.8/LinearAlgebra/src/matmul.jl:141 [inlined]
 [13] #1466
    @ ~/.julia/packages/ChainRules/2ql0h/src/rulesets/Base/arraymath.jl:56 [inlined]
 [14] unthunk
    @ ~/.julia/packages/ChainRulesCore/C73ay/src/tangent_types/thunks.jl:204 [inlined]
 [15] unthunk
    @ ~/.julia/packages/ChainRulesCore/C73ay/src/tangent_types/thunks.jl:237 [inlined]
 [16] wrap_chainrules_output
    @ ~/.julia/packages/Zygote/dABKa/src/compiler/chainrules.jl:105 [inlined]
 [17] map
    @ ./tuple.jl:223 [inlined]
 [18] wrap_chainrules_output
    @ ~/.julia/packages/Zygote/dABKa/src/compiler/chainrules.jl:106 [inlined]
 [19] ZBack
    @ ~/.julia/packages/Zygote/dABKa/src/compiler/chainrules.jl:206 [inlined]
 [20] Pullback
    @ ~/.julia/packages/Flux/4k0Ls/src/layers/basic.jl:172 [inlined]
 [21] (::typeof(∂(λ)))(Δ::Base.ReshapedArray{Float32, 2, PermutedDimsArray{Float32, 3, (2, 1, 3), (2, 1, 3), CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}}, Tuple{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}, Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}}})
    @ Zygote ~/.julia/packages/Zygote/dABKa/src/compiler/interface2.jl:0
 [22] macro expansion
    @ ~/.julia/packages/Flux/4k0Ls/src/layers/basic.jl:53 [inlined]
 [23] Pullback
    @ ~/.julia/packages/Flux/4k0Ls/src/layers/basic.jl:53 [inlined]
 [24] (::typeof(∂(_applychain)))(Δ::CuArray{Float32, 3, CUDA.Mem.DeviceBuffer})
    @ Zygote ~/.julia/packages/Zygote/dABKa/src/compiler/interface2.jl:0
 [25] Pullback
    @ ~/.julia/packages/Flux/4k0Ls/src/layers/basic.jl:51 [inlined]
 [26] (::typeof(∂(λ)))(Δ::CuArray{Float32, 3, CUDA.Mem.DeviceBuffer})
    @ Zygote ~/.julia/packages/Zygote/dABKa/src/compiler/interface2.jl:0
 [27] Pullback
    @ ~/code/<MY_PKG>/src/models/mymodel.jl:75 [inlined]
 [28] (::typeof(∂(λ)))(Δ::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer})
    @ Zygote ~/.julia/packages/Zygote/dABKa/src/compiler/interface2.jl:0
 [29] Pullback
    @ ./REPL[8]:1 [inlined]
 [30] (::typeof(∂(loss)))(Δ::Float32)
    @ Zygote ~/.julia/packages/Zygote/dABKa/src/compiler/interface2.jl:0
 [31] #208
    @ ~/.julia/packages/Zygote/dABKa/src/lib/lib.jl:206 [inlined]
 [32] #2066#back
    @ ~/.julia/packages/ZygoteRules/AIbCs/src/adjoint.jl:67 [inlined]
 [33] Pullback
    @ ./REPL[31]:1 [inlined]
 [34] (::typeof(∂(#27)))(Δ::Float32)
    @ Zygote ~/.julia/packages/Zygote/dABKa/src/compiler/interface2.jl:0
 [35] (::Zygote.var"#99#100"{Zygote.Params{Zygote.Buffer{Any, Vector{Any}}}, typeof(∂(#27)), Zygote.Context{true}})(Δ::Float32)
    @ Zygote ~/.julia/packages/Zygote/dABKa/src/compiler/interface.jl:378
 [36] gradient(f::Function, args::Zygote.Params{Zygote.Buffer{Any, Vector{Any}}})
    @ Zygote ~/.julia/packages/Zygote/dABKa/src/compiler/interface.jl:97
 [37] top-level scope
    @ REPL[31]:1
 [38] top-level scope
    @ ~/.julia/packages/CUDA/DfvRa/src/initialization.jl:52

Here, the line ~/code/<MY_PKG>/src/models/mymodel.jl:75 in the point [27] of the stack trace corresponds to

function (m::MyModel)(X)
...
    X = batched_mul(X, m.f(X))
...
end

where m.f = stnKD(64).

This looks like Use with multiple wrappers · Issue #21 · JuliaGPU/Adapt.jl · GitHub. tl;dr is that one wrapper like transpose(::CuArray) is usually fine, but two or more result in dispatch not sending it to the CUDA routines, so that instead you get the slow generic_matmatmul! which works by indexing.

I think this is the multiplication in ordinary Dense. I don’t see this reshape in the trace: https://github.com/FluxML/Flux.jl/blob/15c85908fdc8766c71141c5cee24726b12b0583b/src/layers/basic.jl#L175-L176 Do you know what is creating a reshaped PermutedDimsArray?

Oh now it’s changed, with this:

What types are X and m.f(X)? batched_mul may reshape things but always from 2 to 3 dimensions, not 3 to 2.

1 Like

Let me try to remove PermutedDimsArray and see if it solves the problem. Thank you!!!

UPDATE. I managed to avoid creating PermutedDimsArray inside the model by pre-permuting the data outside the model. Now the code works on GPU and about 10X faster than on CPU. Maybe I can further optimize the code for GPU, but for now this is good enough. Thank you all for the help!