Using Zygote with custom CUDA kernels

oscarknagg · August 2, 2021, 3:53pm

Hi all. I’m wondering if it’s possible to use Zygote to automatically differentiate CUDA kernels that I’ve defined. A minimal (not-quite working) example is below. My actual use case is not as trivial but this is essentially the approach I’m taking.

using CUDA
using Zygote

function gpu_multiply!(y, x)
    index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    @inbounds y[index] *= x[index]
    return
end

function gpu_multiply(x, y)
    z = copy(y)
    numblocks = ceil(Int, length(y)/256)
    CUDA.@sync begin
        @cuda threads=256 blocks=numblocks gpu_multiply!(z, x)
    end
    return z
end

function gpu_multiply_sum(x, y)
    return sum(gpu_multiply(x, y))
end

N = 4
x = CUDA.fill(1.0f0, N)
y = CUDA.fill(2.0f0, N)
z = gpu_multiply(x_cuda, y)
println(sum(z))

gradients = gradient((x, y) -> sum(x .* y), x, y)
println("Gradients = $gradients")
#  This gives: (Float32[2.0, 2.0, 2.0, 2.0], Float32[1.0, 1.0, 1.0, 1.0])

custom_kernel_gradients = gradient((x, y) -> sum(gpu_multiply(x, y)), x, y)
println("Custom kernel gradients = $gradients_cuda")
# This throws
# ERROR: LoadError: this intrinsic must be compiled to be called

Stepping through the the last shows that the error is thrown while attempting to call Zygote._pullback on the Core.Intrinsics.llvmcall function. I’m not sure what to make of this.

Does anyone else have experience getting Zygote to work with custom kernels? Is this a supported feature at all? Should I be trying a different approach with ForwardDiff or another library?

Any advice is welcome.

mcabbott · August 3, 2021, 12:19am

Zygote is not expected to figure this out. It would fail on the corresponding CPU code because it mutates arrays, but even if you avoided this, it will struggle with scalar loops.

There is a PR to make KernelAbstractions.jl derive a gradient kernel, using Enzyme: https://github.com/JuliaGPU/KernelAbstractions.jl/pull/255

Otherwise the rule is that you need to write a second kernel which computes the gradient, and hook them together with a rule (e.g. for ChainRules).

This is also what Tullio.jl will automate – for formulae it can understand, it will write both the forward and the gradient kernel.

oscarknagg · August 3, 2021, 1:02pm

Tullio seems like a good solution for what I’m doing as Einstein summation is flexible enough to express my use case. Also it seems like Tullio is pretty fast and works on GPU too, some timings for the minimal example in the OP:

julia> using CUDA, KernelAbstractions, CUDAKernels, Tullio

julia> x, y = CUDA.fill(1.0f0, 4), CUDA.fill(2.0f0, 4);

julia> tullio_mult(a, b) = @tullio out[i] := a[i] * b[i]
tullio_mult (generic function with 1 method)

julia> CUDA.@time CUDA.@sync tullio_mult(x, y);
  0.000394 seconds (353 CPU allocations: 8.219 KiB) (1 GPU allocation: 16 bytes, 2.92% gc time of which 69.57% spent allocating)

julia> CUDA.@time CUDA.@sync gpu_multiply(x, y);
  0.000316 seconds (355 CPU allocations: 5.891 KiB) (1 GPU allocation: 16 bytes, 5.63% gc time of which 80.34% spent allocating)

It also seems fast for larger examples too.

julia> x, y = CUDA.fill(1.0f0, 2^30), CUDA.fill(2.0f0, 2^30);

julia> CUDA.@time CUDA.@sync tullio_mult(x, y);
  0.195727 seconds (77.10 k CPU allocations: 1.179 MiB, 69.54% gc time) (1 GPU allocation: 4.000 GiB, 69.59% gc time of which 0.01% spent allocating)

julia> CUDA.@time CUDA.@sync gpu_multiply(x, y);
  0.185908 seconds (98.58 k CPU allocations: 1.505 MiB, 67.56% gc time) (1 GPU allocation: 4.000 GiB, 67.60% gc time of which 0.01% spent allocating)

mcabbott · August 3, 2021, 1:11pm

Yes, Tullio should be competitive with a straightforward kernel. There are I think many tricks for making things like matrix multiplication fast, which it doesn’t know.

Topic		Replies	Views
Fast GPU Kernels differentiable with Zygote GPU tullio	7	1795	March 9, 2021
Custom backpropagation rule on GPU GPU	2	411	August 29, 2022
Zygote terribly slow. What am I doing wrong? Machine Learning	12	1742	January 19, 2022
Zygote errors on simple operations with Complex CUDA Arrays GPU cuda , zygote	0	391	May 4, 2021
Flux, CUDA, Zygote : InvalidIRError: compiling kernel getindex_kernel(CUDA.CuKernelContext, CuDeviceArray New to Julia cuda , flux , zygote	4	1007	December 30, 2020

Using Zygote with custom CUDA kernels

Related topics