Fast GPU Kernels differentiable with Zygote

roflmaostc · March 8, 2021, 6:32pm

Hey,

so far I was using Tullio.jl to generate array based regularizers which are also (very) fast using Zygote.jl.

So as an example a Total Variation regularizer:

reg(x) = @tullio res = sqrt(abs2(x[i, j] - x[i+1, j]) +abs2(x[i, j] - x[i, j+1]) + 1f-8)

Unfortunately Tullio is slow on GPUs (CUDA) or not even properly working without scalar indexing:

julia > reg(CuArray(img))
scalar getindex is disallowed

See also this issue.

So I was wondering if it is possible to write fast CUDA Kernels without the need to write down the gradient by hand? My search only brought me to KernelAbstractions.jl

My hope is, that Tullio.jl might improve CUDA performance but it would be nice to have something working soon.
In worst case, I probably need to write both function and derivative by hand, correct?
I would be also deeply grateful if one could provide me resources for it since I’m quite overwhelmed as a GPU rookie.

Thanks,

Felix

simeonschaub · March 8, 2021, 7:02pm

Did you do using CUDA, KernelAbstractions first? Otherwise Tullio won’t generate GPU kernels, but try to use the generic fallback instead.

roflmaostc · March 8, 2021, 7:56pm

Thanks for your answer, my fault

My first naive try:

julia> using CUDA, KernelAbstractions, Tullio, Zygote

julia> reg(x) = @tullio res = sqrt(abs2(x[i, j] - x[i+1, j]) +abs2(x[i, j] - x[i, j+1]))
reg (generic function with 1 method)

julia> x = CUDA.rand(10, 10);

julia> reg(x)
ERROR: MethodError: Cannot `convert` an object of type Nothing to an object of type Float32
Closest candidates are:
  convert(::Type{T}, ::LLVM.GenericValue, ::LLVM.LLVMType) where T<:AbstractFloat at /home/fxw/.julia/packages/LLVM/7Q46C/src/execution.jl:39
  convert(::Type{T}, ::LLVM.ConstantFP) where T<:AbstractFloat at /home/fxw/.julia/packages/LLVM/7Q46C/src/core/value/constant.jl:98
  convert(::Type{T}, ::Base.TwicePrecision) where T<:Number at twiceprecision.jl:250
  ...
Stacktrace:
 [1] thread_scalar
   @ ~/.julia/packages/Tullio/bgqFi/src/threads.jl:237 [inlined]
 [2] ℳ𝒶𝓀ℯ
   @ ~/.julia/packages/Tullio/bgqFi/src/macro.jl:805 [inlined]
 [3] (::Tullio.Eval{var"#ℳ𝒶𝓀ℯ#10"{var"#𝒜𝒸𝓉!#1"}, var"#52#∇ℳ𝒶𝓀ℯ#9"{var"#∇𝒜𝒸𝓉!#5"}})(args::CuArray{Float32, 2})
   @ Tullio ~/.julia/packages/Tullio/bgqFi/src/eval.jl:20
 [4] reg(x::CuArray{Float32, 2})
   @ Main ./REPL[1]:1
 [5] top-level scope
   @ REPL[3]:1

Second try:

reg(x) = sum(@tullio res[i, j] := sqrt(abs2(x[i, j] - x[i+1, j]) +abs2(x[i, j] - x[i, j+1])))

Small benchmarks:

julia> x = randn(Float32, (10_000, 10_000));

julia> x_c = CuArray(x);

julia> @time reg(x)
  0.102237 seconds (182 allocations: 381.402 MiB)
1.7428845f8


julia> @CUDA.time reg(x_c)
  0.012611 seconds (203 CPU allocations: 6.828 KiB) (3 GPU allocations: 381.394 MiB, 2.52% gc time of which 97.30% spent allocating)
1.7428845f8

For derivatives:

julia> @time Zygote.gradient(reg, x);
  0.216524 seconds (434 allocations: 762.883 MiB)

julia> @CUDA.time Zygote.gradient(reg, x_c);
  0.030227 seconds (467 CPU allocations: 16.625 KiB) (5 GPU allocations: 1.117 GiB, 4.89% gc time of which 99.24% spent allocating)

So for this special kind it’s a factor of ~8, actually better than expected
I’ve seen you contributed also to Tullio.jl. Do you have more insight in what can be expected in the future (~1 year?)?

For example, using fft as reference, I get there an impressive factor of x110 speedup.
CPU is a AMD Ryzen 5 5600x and GPU a RTX 2060 Super 8GB.

EDIT: the sum construct (second version) is ~4 times slower than the simple Tullio expression. How to fix that actually?

julia> reg(x) = @tullio res = sqrt(abs2(x[i, j, k] - x[i+1, j, k]) +abs2(x[i, j, k] - x[i, j+1, k]) + abs2(x[i, j, k] - x[i, j, k+1]))
reg (generic function with 1 method)

julia> reg2(x) = sum(@tullio res[i, j, k] := sqrt(abs2(x[i, j, k] - x[i+1, j, k]) +abs2(x[i, j, k] - x[i, j+1, k]) + abs2(x[i, j, k] - x[i, j, k+1])))
reg2 (generic function with 1 method)

julia> @time reg(x)
  0.010721 seconds (225 allocations: 10.500 KiB)
7.303821f7

julia> @time reg2(x)
  0.037554 seconds (196 allocations: 126.513 MiB)
7.3038136f7

mcabbott · March 9, 2021, 12:48am

Tullio’s GPU support is a bit experimental, and pretty naiive, for now it tries to write the simplest KernelAbstractions code. I think this works best on things close to broadcasting.

I think scalar reductions won’t work at all, right now, although they could have a friendlier error. They need another plan for “threading”, as there is no index it’s safe to divide the output along. If someone knows a good way to write (say) this example in KA, it might not be so hard to generalise, please make an issue.

This example might work better like this, allocating a smaller temporary res but I haven’t tried it:


reg3(x) = sum(@tullio res[j] := sqrt(abs2(x[i, j] - x[i+1, j]) +abs2(x[i, j] - x[i, j+1])))

luraess · March 9, 2021, 10:02am

If you have simple functions you wish to be able to be executed both on CPU (including multi-threading) and GPUs, you could have a look at ParallelStencil.jl.

roflmaostc · March 9, 2021, 10:09am

That looks interesting!

What about automatic differentiation? Is it fast?

luraess · March 9, 2021, 10:23am

There is no automatic differentiation support right now in there, but you may be able to combine it in a workflow that would do it I suppose. If using CPU, ParallelStencil uses regular Julia arrays, and on GPU, it switches to CuArray.

roflmaostc · March 9, 2021, 10:29am

Strange, reg3 works for both CPU and GPU. For CPUs it is as fast as reg. For GPUs, it is even slower than reg2.

I’m currently trying a few combinations. Some results can be found in this jupyter notebook.

Topic		Replies	Views
Using Zygote with custom CUDA kernels Optimization (Mathematical) gpu , cuda , zygote , forwarddiff	3	1168	August 3, 2021
How to properly use Tullio + CUDA on a package extension? General Usage cuda	5	663	December 7, 2023
Zygote with Tullio gives wrong gradients/pullbacks using CUDA Machine Learning	1	224	February 26, 2024
Element-wise operations in Tullio.jl on GPU Numerics gpu , tullio	2	1128	November 28, 2021
Differentiable Regularizers for CUDA, CPUs? Specific Domains	0	40	September 26, 2024

Fast GPU Kernels differentiable with Zygote

Related topics