Fast GPU Kernels differentiable with Zygote


so far I was using Tullio.jl to generate array based regularizers which are also (very) fast using Zygote.jl.

So as an example a Total Variation regularizer:

reg(x) = @tullio res = sqrt(abs2(x[i, j] - x[i+1, j]) +abs2(x[i, j] - x[i, j+1]) + 1f-8)

Unfortunately Tullio is slow on GPUs (CUDA) or not even properly working without scalar indexing:

julia > reg(CuArray(img))
scalar getindex is disallowed

See also this issue.

So I was wondering if it is possible to write fast CUDA Kernels without the need to write down the gradient by hand? My search only brought me to KernelAbstractions.jl

My hope is, that Tullio.jl might improve CUDA performance but it would be nice to have something working soon.
In worst case, I probably need to write both function and derivative by hand, correct?
I would be also deeply grateful if one could provide me resources for it since I’m quite overwhelmed as a GPU rookie.



Did you do using CUDA, KernelAbstractions first? Otherwise Tullio won’t generate GPU kernels, but try to use the generic fallback instead.

1 Like

Thanks for your answer, my fault :neutral_face:

My first naive try:

julia> using CUDA, KernelAbstractions, Tullio, Zygote

julia> reg(x) = @tullio res = sqrt(abs2(x[i, j] - x[i+1, j]) +abs2(x[i, j] - x[i, j+1]))
reg (generic function with 1 method)

julia> x = CUDA.rand(10, 10);

julia> reg(x)
ERROR: MethodError: Cannot `convert` an object of type Nothing to an object of type Float32
Closest candidates are:
  convert(::Type{T}, ::LLVM.GenericValue, ::LLVM.LLVMType) where T<:AbstractFloat at /home/fxw/.julia/packages/LLVM/7Q46C/src/execution.jl:39
  convert(::Type{T}, ::LLVM.ConstantFP) where T<:AbstractFloat at /home/fxw/.julia/packages/LLVM/7Q46C/src/core/value/constant.jl:98
  convert(::Type{T}, ::Base.TwicePrecision) where T<:Number at twiceprecision.jl:250
 [1] thread_scalar
   @ ~/.julia/packages/Tullio/bgqFi/src/threads.jl:237 [inlined]
 [2] ℳ𝒶𝓀ℯ
   @ ~/.julia/packages/Tullio/bgqFi/src/macro.jl:805 [inlined]
 [3] (::Tullio.Eval{var"#ℳ𝒶𝓀ℯ#10"{var"#𝒜𝒸𝓉!#1"}, var"#52#∇ℳ𝒶𝓀ℯ#9"{var"#∇𝒜𝒸𝓉!#5"}})(args::CuArray{Float32, 2})
   @ Tullio ~/.julia/packages/Tullio/bgqFi/src/eval.jl:20
 [4] reg(x::CuArray{Float32, 2})
   @ Main ./REPL[1]:1
 [5] top-level scope
   @ REPL[3]:1

Second try:

reg(x) = sum(@tullio res[i, j] := sqrt(abs2(x[i, j] - x[i+1, j]) +abs2(x[i, j] - x[i, j+1])))

Small benchmarks:

julia> x = randn(Float32, (10_000, 10_000));

julia> x_c = CuArray(x);

julia> @time reg(x)
  0.102237 seconds (182 allocations: 381.402 MiB)

julia> @CUDA.time reg(x_c)
  0.012611 seconds (203 CPU allocations: 6.828 KiB) (3 GPU allocations: 381.394 MiB, 2.52% gc time of which 97.30% spent allocating)

For derivatives:

julia> @time Zygote.gradient(reg, x);
  0.216524 seconds (434 allocations: 762.883 MiB)

julia> @CUDA.time Zygote.gradient(reg, x_c);
  0.030227 seconds (467 CPU allocations: 16.625 KiB) (5 GPU allocations: 1.117 GiB, 4.89% gc time of which 99.24% spent allocating)

So for this special kind it’s a factor of ~8, actually better than expected :slight_smile:
I’ve seen you contributed also to Tullio.jl. Do you have more insight in what can be expected in the future (~1 year?)?

For example, using fft as reference, I get there an impressive factor of x110 speedup.
CPU is a AMD Ryzen 5 5600x and GPU a RTX 2060 Super 8GB.

EDIT: the sum construct (second version) is ~4 times slower than the simple Tullio expression. How to fix that actually?

julia> reg(x) = @tullio res = sqrt(abs2(x[i, j, k] - x[i+1, j, k]) +abs2(x[i, j, k] - x[i, j+1, k]) + abs2(x[i, j, k] - x[i, j, k+1]))
reg (generic function with 1 method)

julia> reg2(x) = sum(@tullio res[i, j, k] := sqrt(abs2(x[i, j, k] - x[i+1, j, k]) +abs2(x[i, j, k] - x[i, j+1, k]) + abs2(x[i, j, k] - x[i, j, k+1])))
reg2 (generic function with 1 method)

julia> @time reg(x)
  0.010721 seconds (225 allocations: 10.500 KiB)

julia> @time reg2(x)
  0.037554 seconds (196 allocations: 126.513 MiB)

Tullio’s GPU support is a bit experimental, and pretty naiive, for now it tries to write the simplest KernelAbstractions code. I think this works best on things close to broadcasting.

I think scalar reductions won’t work at all, right now, although they could have a friendlier error. They need another plan for “threading”, as there is no index it’s safe to divide the output along. If someone knows a good way to write (say) this example in KA, it might not be so hard to generalise, please make an issue.

This example might work better like this, allocating a smaller temporary res but I haven’t tried it:

reg3(x) = sum(@tullio res[j] := sqrt(abs2(x[i, j] - x[i+1, j]) +abs2(x[i, j] - x[i, j+1])))

1 Like

If you have simple functions you wish to be able to be executed both on CPU (including multi-threading) and GPUs, you could have a look at ParallelStencil.jl.

1 Like

That looks interesting!

What about automatic differentiation? Is it fast?

There is no automatic differentiation support right now in there, but you may be able to combine it in a workflow that would do it I suppose. If using CPU, ParallelStencil uses regular Julia arrays, and on GPU, it switches to CuArray.

Strange, reg3 works for both CPU and GPU. For CPUs it is as fast as reg. For GPUs, it is even slower than reg2.

I’m currently trying a few combinations. Some results can be found in this jupyter notebook.