Using Zygote with custom CUDA kernels

Hi all. I’m wondering if it’s possible to use Zygote to automatically differentiate CUDA kernels that I’ve defined. A minimal (not-quite working) example is below. My actual use case is not as trivial but this is essentially the approach I’m taking.

using CUDA
using Zygote

function gpu_multiply!(y, x)
    index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    @inbounds y[index] *= x[index]
    return
end

function gpu_multiply(x, y)
    z = copy(y)
    numblocks = ceil(Int, length(y)/256)
    CUDA.@sync begin
        @cuda threads=256 blocks=numblocks gpu_multiply!(z, x)
    end
    return z
end

function gpu_multiply_sum(x, y)
    return sum(gpu_multiply(x, y))
end

N = 4
x = CUDA.fill(1.0f0, N)
y = CUDA.fill(2.0f0, N)
z = gpu_multiply(x_cuda, y)
println(sum(z))

gradients = gradient((x, y) -> sum(x .* y), x, y)
println("Gradients = $gradients")
#  This gives: (Float32[2.0, 2.0, 2.0, 2.0], Float32[1.0, 1.0, 1.0, 1.0])

custom_kernel_gradients = gradient((x, y) -> sum(gpu_multiply(x, y)), x, y)
println("Custom kernel gradients = $gradients_cuda")
# This throws
# ERROR: LoadError: this intrinsic must be compiled to be called

Stepping through the the last shows that the error is thrown while attempting to call Zygote._pullback on the Core.Intrinsics.llvmcall function. I’m not sure what to make of this.

Does anyone else have experience getting Zygote to work with custom kernels? Is this a supported feature at all? Should I be trying a different approach with ForwardDiff or another library?

Any advice is welcome.

Zygote is not expected to figure this out. It would fail on the corresponding CPU code because it mutates arrays, but even if you avoided this, it will struggle with scalar loops.

There is a PR to make KernelAbstractions.jl derive a gradient kernel, using Enzyme: https://github.com/JuliaGPU/KernelAbstractions.jl/pull/255

Otherwise the rule is that you need to write a second kernel which computes the gradient, and hook them together with a rule (e.g. for ChainRules).

This is also what Tullio.jl will automate – for formulae it can understand, it will write both the forward and the gradient kernel.

1 Like

Tullio seems like a good solution for what I’m doing as Einstein summation is flexible enough to express my use case. Also it seems like Tullio is pretty fast and works on GPU too, some timings for the minimal example in the OP:

julia> using CUDA, KernelAbstractions, CUDAKernels, Tullio

julia> x, y = CUDA.fill(1.0f0, 4), CUDA.fill(2.0f0, 4);

julia> tullio_mult(a, b) = @tullio out[i] := a[i] * b[i]
tullio_mult (generic function with 1 method)

julia> CUDA.@time CUDA.@sync tullio_mult(x, y);
  0.000394 seconds (353 CPU allocations: 8.219 KiB) (1 GPU allocation: 16 bytes, 2.92% gc time of which 69.57% spent allocating)

julia> CUDA.@time CUDA.@sync gpu_multiply(x, y);
  0.000316 seconds (355 CPU allocations: 5.891 KiB) (1 GPU allocation: 16 bytes, 5.63% gc time of which 80.34% spent allocating)

It also seems fast for larger examples too.

julia> x, y = CUDA.fill(1.0f0, 2^30), CUDA.fill(2.0f0, 2^30);

julia> CUDA.@time CUDA.@sync tullio_mult(x, y);
  0.195727 seconds (77.10 k CPU allocations: 1.179 MiB, 69.54% gc time) (1 GPU allocation: 4.000 GiB, 69.59% gc time of which 0.01% spent allocating)

julia> CUDA.@time CUDA.@sync gpu_multiply(x, y);
  0.185908 seconds (98.58 k CPU allocations: 1.505 MiB, 67.56% gc time) (1 GPU allocation: 4.000 GiB, 67.60% gc time of which 0.01% spent allocating)

Yes, Tullio should be competitive with a straightforward kernel. There are I think many tricks for making things like matrix multiplication fast, which it doesn’t know.