Each iteration of the for loop mutates img for the next iteration. There is a lot of intra-loop dependency here. Is this desired? Usually, a new image is used for the destination in such operations.
Additionally, providing code is nice… providing running code is much nicer.
julia> using TensorCast # my package
julia> @pretty @reduce img[i, j] = sum(k, l) img[k, l] - func(img[k, l] - img[i, j])
begin
@boundscheck ndims(img) == 2 || throw(ArgumentError("expected a 2-tensor img[i, j]"))
local starling = transmute(img, Val((nothing, nothing, 1, 2)))
sum!(img, @__dot__(starling - func(starling - img)))
end
Note that this allocates an intermediate array of size (size(img)..., size(img)...). With CUDA it should be possible to avoid this via lazy broadcasting but not so easy to write out.
Another way is this, which does not use broadcasting:
using Tullio, KernelAbstractions, CUDA
@tullio img[i, j] = img[k, l] - func(img[k, l] - img[i, j]) # implicit sum over k,l