Zygote is not expected to figure this out. It would fail on the corresponding CPU code because it mutates arrays, but even if you avoided this, it will struggle with scalar loops.
There is a PR to make KernelAbstractions.jl derive a gradient kernel, using Enzyme: https://github.com/JuliaGPU/KernelAbstractions.jl/pull/255
Otherwise the rule is that you need to write a second kernel which computes the gradient, and hook them together with a rule (e.g. for ChainRules).
This is also what Tullio.jl will automate – for formulae it can understand, it will write both the forward and the gradient kernel.