Using Zygote with custom CUDA kernels

Tullio seems like a good solution for what I’m doing as Einstein summation is flexible enough to express my use case. Also it seems like Tullio is pretty fast and works on GPU too, some timings for the minimal example in the OP:

julia> using CUDA, KernelAbstractions, CUDAKernels, Tullio

julia> x, y = CUDA.fill(1.0f0, 4), CUDA.fill(2.0f0, 4);

julia> tullio_mult(a, b) = @tullio out[i] := a[i] * b[i]
tullio_mult (generic function with 1 method)

julia> CUDA.@time CUDA.@sync tullio_mult(x, y);
  0.000394 seconds (353 CPU allocations: 8.219 KiB) (1 GPU allocation: 16 bytes, 2.92% gc time of which 69.57% spent allocating)

julia> CUDA.@time CUDA.@sync gpu_multiply(x, y);
  0.000316 seconds (355 CPU allocations: 5.891 KiB) (1 GPU allocation: 16 bytes, 5.63% gc time of which 80.34% spent allocating)

It also seems fast for larger examples too.

julia> x, y = CUDA.fill(1.0f0, 2^30), CUDA.fill(2.0f0, 2^30);

julia> CUDA.@time CUDA.@sync tullio_mult(x, y);
  0.195727 seconds (77.10 k CPU allocations: 1.179 MiB, 69.54% gc time) (1 GPU allocation: 4.000 GiB, 69.59% gc time of which 0.01% spent allocating)

julia> CUDA.@time CUDA.@sync gpu_multiply(x, y);
  0.185908 seconds (98.58 k CPU allocations: 1.505 MiB, 67.56% gc time) (1 GPU allocation: 4.000 GiB, 67.60% gc time of which 0.01% spent allocating)