Gradient on GPU 70X slower than on CPU

wsshin · October 27, 2022, 9:44pm

I just started using Flux.jl. I have a loss function whose gradient is calculated much slower on GPU than on CPU. As a machine learning project typically involves a large dataset, it is difficult to create a MWE, but I wonder if someone have any suggestions by looking at the types of the arguments.

Here are the packages I am using:

using Flux
using Flux: onehotbatch, onecold, crossentropy, params
using CUDA

I have a loss function

loss(x, y) = crossentropy(m(x), y)

defined with some model m. On CPU, I can evaluate the gradient of the loss function in 10 secs:

julia> @timev gradient(()->loss(t...), ps)
  8.022828 seconds (4.67 M allocations: 7.465 GiB, 16.80% gc time, 38.20% compilation time)
elapsed time (ns):  8022828062
gc time (ns):       1347510721
bytes allocated:    8014950633
pool allocs:        4671353
non-pool GC allocs: 512
malloc() calls:     290
free() calls:       100
minor collections:  3
full collections:   3
Grads(...)

Here the type of t is

julia> typeof(t)
Tuple{Array{Float32, 3}, OneHotArrays.OneHotMatrix{UInt32, 10, Vector{UInt32}}}

and the types of the elements of ps are a combination of 1D, 2D, and 3D arrays:

julia> typeof.(ps)
62-element Vector{DataType}:
 Array{Float32, 3}
 ⋮
 Vector{Float32} (alias for Array{Float32, 1})
 ⋮
 Matrix{Float32} (alias for Array{Float32, 2})
 ⋮

Now, to perform the gradient on GPU, I defined a new loss function as follows:

m2 = m |> gpu
loss2(x,y) = crossentropy(m2(x), y)

I also prepared the CUDA version of t and ps:

julia> typeof(t2)
Tuple{CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}, OneHotArrays.OneHotMatrix{UInt32, 10, CuArray{UInt32, 1, CUDA.Mem.DeviceBuffer}}}

julia> typeof.(ps2)
62-element Vector{DataType}:
 CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}
 ⋮
 CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}
 ⋮
 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}
 ⋮

You can see that the standard Arrays in t and ps are converted to the corresponding CuArrays in t2 and ps2.

However, when I peform gradient on loss2 using t2 and ps2, it takes much longer:

julia> @timev gradient(()->loss2(t2...), ps2)
559.368659 seconds (209.64 M allocations: 31.599 GiB, 0.86% gc time, 0.26% compilation time)
elapsed time (ns):  559368659181
gc time (ns):       4785176794
bytes allocated:    33928929401
pool allocs:        209638516
non-pool GC allocs: 61
malloc() calls:     9
free() calls:       223
minor collections:  101
full collections:   0
Grads(...)

So the gradient calculation on GPU is ~70X slower than on CPU.

Any suggestions on how to tackle this issue? Even if not a direct solution to the issue, I will appreciate any suggestions that help understanding the cause.

jmair · October 27, 2022, 10:53pm

Which GPU and CPU are you using?

ToucheSir · October 27, 2022, 11:07pm

What does CUDA.@time gradient(()->loss2(t2...), ps2) say? Compilation latency is higher for models on GPU, but 70x higher would be pretty extreme.

wsshin · October 28, 2022, 2:01pm

@jmair:

CPU: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
GPU: NVIDIA Volta V100

@ToucheSir:
Here is the CUDA.@time result:

julia> CUDA.@time gradient(()->loss2(t2...), ps2)
562.157332 seconds (209.62 M CPU allocations: 31.598 GiB, 0.82% gc time) (469 GPU allocations: 7.648 GiB, 0.01% memmgmt time)
Grads(...)

The result is not much different from @timev.

wsshin · October 28, 2022, 3:18pm

I would like to note that the loss function itself is evaluated much faster on GPU:

julia> @timev loss(t...)
  1.066440 seconds (1.27 k allocations: 1.760 GiB, 4.48% gc time)
elapsed time (ns):  1066439502
gc time (ns):       47781243
bytes allocated:    1889943472
pool allocs:        1192
non-pool GC allocs: 2
malloc() calls:     72
free() calls:       71
minor collections:  4
full collections:   0
2.311946f0

julia> @timev loss2(t2...)
  0.016737 seconds (3.04 k allocations: 154.562 KiB)
elapsed time (ns):  16737110
gc time (ns):       0
bytes allocated:    158272
pool allocs:        3038
non-pool GC allocs: 0
minor collections:  0
full collections:   0
2.3874822f0

Also, I am using up-to-date Flux.jl and CUDA.jl on Julia 1.8:

(@v1.8) pkg> st Flux
Status `~/.julia/environments/v1.8/Project.toml`
  [587475ba] Flux v0.13.6

(@v1.8) pkg> st CUDA
Status `~/.julia/environments/v1.8/Project.toml`
  [052768ef] CUDA v3.12.0

ToucheSir · October 28, 2022, 4:07pm

Yeah, doesn’t look like GPU memory usage is an issue if you have access to an entire V100. Assuming ps2 = params(m), can you try timing the following:

loss2(m, x, y) = crossentropy(m(x), y)
...
gradient(loss2, m, t2...)

If possible, it would be great if you could share your model/loss definition code and a short snippet to run both on some dummy data. It won’t be a MWE, but it’d be something reproducible we can play with.

Edit: I would also recommend testing in a newly created environment in case something else in your global one is causing certain dependencies of Flux to be accidentally downgraded. Then you have the option to try something like this:

# ] add GPUCompiler

using GPUCompiler
GPUCompiler.enable_timings()

# setup ...

loss2(m, x, y) = crossentropy(m(x), y)
gradient(loss2, m, t2...)

GPUCompiler.timings()

This would let us know how much time is tied up in compiling GPU kernel code.

avikpal · October 28, 2022, 4:08pm

Do you see any scalar indexing warning? You can disable it by CUDA.allowscalar(false)

wsshin · October 28, 2022, 5:46pm

@avikpal, Indeed! Calculating the gradient after CUDA.allowscalar(false) generated an error:

julia> CUDA.allowscalar(false)

julia> gradient(()->loss2(t2...), ps2)
ERROR: Scalar indexing is disallowed.
...

I didn’t think this was an issue, because I thought all the variables and parameters were properly converted to CuArray. What could be the source of scalar indexing? Is there a strategy to identify it?

@ToucheSir, I tried your suggested gradient(loss2, m, t2...) after CUDA.allowscalar(false), and I get the same error. I guess once I solve the scalar indexing issue, my problem will be resolved.

ToucheSir · October 28, 2022, 6:25pm

If you look at the stack trace, it should tell you which function is causing scalar indexing. Most likely part of your model has not implemented the Functors interface (e.g. by using @functor) and so only some parameters have been moved to the GPU.

wsshin · October 28, 2022, 6:55pm

I don’t know much about @functor, but I am defining my model as

struct MyModel
    ...
end

Flux.@functor MyModel

and creating a GPU-compatible model by

m = MyModel() |> gpu

Is there anything else I need to do?

ToucheSir · October 28, 2022, 7:27pm

That definition looks fine, but there may be layers nested more deeply in MyModel which we’re not seeing and are problematic. That’s why I recommend looking at the stacktrace. It’ll tell you exactly where the scalar indexing happened and make it much easier to find the root cause. Feel free to post that here too.

wsshin · October 28, 2022, 7:49pm

Thanks for willing to look into the stack trace. Here it is:

julia> gradient(()->loss(t2...), ps2)
ERROR: Scalar indexing is disallowed.
Invocation of getindex resulted in scalar indexing of a GPU array.
This is typically caused by calling an iterating implementation of a method.
Such implementations *do not* execute on the GPU, but very slowly on the CPU,
and therefore are only permitted from the REPL for prototyping purposes.
If you did intend to index this array, annotate the caller with @allowscalar.
Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:35
  [2] assertscalar(op::String)
    @ GPUArraysCore ~/.julia/packages/GPUArraysCore/lojQM/src/GPUArraysCore.jl:87
  [3] getindex(::CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}, ::Int64, ::Int64, ::Int64)
    @ GPUArrays ~/.julia/packages/GPUArrays/fqD8z/src/host/indexing.jl:9
  [4] getindex
    @ ./permuteddimsarray.jl:71 [inlined]
  [5] _unsafe_getindex_rs
    @ ./reshapedarray.jl:250 [inlined]
  [6] _unsafe_getindex
    @ ./reshapedarray.jl:247 [inlined]
  [7] getindex
    @ ./reshapedarray.jl:235 [inlined]
  [8] _generic_matmatmul!(C::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, tA::Char, tB::Char, A::Base.ReshapedArray{Float32, 2, PermutedDimsArray{Float32, 3, (2, 1, 3), (2, 1, 3), CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}}, Tuple{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}, Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}}}, B::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, _add::LinearAlgebra.MulAddMul{true, true, Bool, Bool})
    @ LinearAlgebra ~/pkg/julia/julia-1.8/usr/share/julia/stdlib/v1.8/LinearAlgebra/src/matmul.jl:876
  [9] generic_matmatmul!(C::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, tA::Char, tB::Char, A::Base.ReshapedArray{Float32, 2, PermutedDimsArray{Float32, 3, (2, 1, 3), (2, 1, 3), CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}}, Tuple{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}, Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}}}, B::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, _add::LinearAlgebra.MulAddMul{true, true, Bool, Bool})
    @ LinearAlgebra ~/pkg/julia/julia-1.8/usr/share/julia/stdlib/v1.8/LinearAlgebra/src/matmul.jl:844
 [10] mul!
    @ ~/pkg/julia/julia-1.8/usr/share/julia/stdlib/v1.8/LinearAlgebra/src/matmul.jl:453 [inlined]
 [11] mul!
    @ ~/pkg/julia/julia-1.8/usr/share/julia/stdlib/v1.8/LinearAlgebra/src/matmul.jl:276 [inlined]
 [12] *
    @ ~/pkg/julia/julia-1.8/usr/share/julia/stdlib/v1.8/LinearAlgebra/src/matmul.jl:141 [inlined]
 [13] #1466
    @ ~/.julia/packages/ChainRules/2ql0h/src/rulesets/Base/arraymath.jl:56 [inlined]
 [14] unthunk
    @ ~/.julia/packages/ChainRulesCore/C73ay/src/tangent_types/thunks.jl:204 [inlined]
 [15] unthunk
    @ ~/.julia/packages/ChainRulesCore/C73ay/src/tangent_types/thunks.jl:237 [inlined]
 [16] wrap_chainrules_output
    @ ~/.julia/packages/Zygote/dABKa/src/compiler/chainrules.jl:105 [inlined]
 [17] map
    @ ./tuple.jl:223 [inlined]
 [18] wrap_chainrules_output
    @ ~/.julia/packages/Zygote/dABKa/src/compiler/chainrules.jl:106 [inlined]
 [19] ZBack
    @ ~/.julia/packages/Zygote/dABKa/src/compiler/chainrules.jl:206 [inlined]
 [20] Pullback
    @ ~/.julia/packages/Flux/4k0Ls/src/layers/basic.jl:172 [inlined]
 [21] (::typeof(∂(λ)))(Δ::Base.ReshapedArray{Float32, 2, PermutedDimsArray{Float32, 3, (2, 1, 3), (2, 1, 3), CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}}, Tuple{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}, Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}}})
    @ Zygote ~/.julia/packages/Zygote/dABKa/src/compiler/interface2.jl:0
 [22] macro expansion
    @ ~/.julia/packages/Flux/4k0Ls/src/layers/basic.jl:53 [inlined]
 [23] Pullback
    @ ~/.julia/packages/Flux/4k0Ls/src/layers/basic.jl:53 [inlined]
 [24] (::typeof(∂(_applychain)))(Δ::CuArray{Float32, 3, CUDA.Mem.DeviceBuffer})
    @ Zygote ~/.julia/packages/Zygote/dABKa/src/compiler/interface2.jl:0
 [25] Pullback
    @ ~/.julia/packages/Flux/4k0Ls/src/layers/basic.jl:51 [inlined]
 [26] (::typeof(∂(λ)))(Δ::CuArray{Float32, 3, CUDA.Mem.DeviceBuffer})
    @ Zygote ~/.julia/packages/Zygote/dABKa/src/compiler/interface2.jl:0
 [27] Pullback
    @ ~/code/<MY_PKG>/src/models/mymodel.jl:75 [inlined]
 [28] (::typeof(∂(λ)))(Δ::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer})
    @ Zygote ~/.julia/packages/Zygote/dABKa/src/compiler/interface2.jl:0
 [29] Pullback
    @ ./REPL[8]:1 [inlined]
 [30] (::typeof(∂(loss)))(Δ::Float32)
    @ Zygote ~/.julia/packages/Zygote/dABKa/src/compiler/interface2.jl:0
 [31] #208
    @ ~/.julia/packages/Zygote/dABKa/src/lib/lib.jl:206 [inlined]
 [32] #2066#back
    @ ~/.julia/packages/ZygoteRules/AIbCs/src/adjoint.jl:67 [inlined]
 [33] Pullback
    @ ./REPL[31]:1 [inlined]
 [34] (::typeof(∂(#27)))(Δ::Float32)
    @ Zygote ~/.julia/packages/Zygote/dABKa/src/compiler/interface2.jl:0
 [35] (::Zygote.var"#99#100"{Zygote.Params{Zygote.Buffer{Any, Vector{Any}}}, typeof(∂(#27)), Zygote.Context{true}})(Δ::Float32)
    @ Zygote ~/.julia/packages/Zygote/dABKa/src/compiler/interface.jl:378
 [36] gradient(f::Function, args::Zygote.Params{Zygote.Buffer{Any, Vector{Any}}})
    @ Zygote ~/.julia/packages/Zygote/dABKa/src/compiler/interface.jl:97
 [37] top-level scope
    @ REPL[31]:1
 [38] top-level scope
    @ ~/.julia/packages/CUDA/DfvRa/src/initialization.jl:52

Here, the line ~/code/<MY_PKG>/src/models/mymodel.jl:75 in the point [27] of the stack trace corresponds to

function (m::MyModel)(X)
...
    X = batched_mul(X, m.f(X))
...
end

where m.f = stnKD(64).

mcabbott · October 28, 2022, 8:04pm

wsshin:

[9] generic_matmatmul!(C::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, tA::Char, tB::Char, A::Base.ReshapedArray{Float32, 2, PermutedDimsArray{Float32, 3, (2, 1, 3), (2, 1, 3), CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}}, Tuple{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}, Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}}}, B::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, _add::LinearAlgebra.MulAddMul{true, true, Bool, Bool})
    @ LinearAlgebra ~/pkg/julia/julia-1.8/usr/share/julia/stdlib/v1.8/LinearAlgebra/src/matmul.jl:844

This looks like Use with multiple wrappers · Issue #21 · JuliaGPU/Adapt.jl · GitHub. tl;dr is that one wrapper like transpose(::CuArray) is usually fine, but two or more result in dispatch not sending it to the CUDA routines, so that instead you get the slow generic_matmatmul! which works by indexing.

I think this is the multiplication in ordinary Dense. I don’t see this reshape in the trace: https://github.com/FluxML/Flux.jl/blob/15c85908fdc8766c71141c5cee24726b12b0583b/src/layers/basic.jl#L175-L176 Do you know what is creating a reshaped PermutedDimsArray?

Oh now it’s changed, with this:

What types are X and m.f(X)? batched_mul may reshape things but always from 2 to 3 dimensions, not 3 to 2.

wsshin · October 28, 2022, 9:39pm

Let me try to remove PermutedDimsArray and see if it solves the problem. Thank you!!!

UPDATE. I managed to avoid creating PermutedDimsArray inside the model by pre-permuting the data outside the model. Now the code works on GPU and about 10X faster than on CPU. Maybe I can further optimize the code for GPU, but for now this is good enough. Thank you all for the help!

Topic		Replies	Views
Flux: GPU slower than CPU? GPU question , performance , flux	7	2179	August 10, 2018
Code using Flux slow on GPU GPU flux	9	3075	November 6, 2019
Flux gpu gradient failing General Usage question	5	851	August 18, 2020
Getting gradients with loss using for-loop is slow in Flux.jl Machine Learning	4	263	August 7, 2023
Neural network in Flux.jl using CUDA is slower General Usage	0	474	July 15, 2020

Gradient on GPU 70X slower than on CPU

Related topics