How to write a kernel to evaluate a function on an arbitrary dimensional grid?

Hi, I’m trying to write a kernel (using KernelAbstractions.jl) to evaluate a function f over an arbitrary dimensional Cartesian grid, which is represented by vectors rs.... To give a concrete example, for a 2D grid I can write the following:

using KernelAbstractions

@kernel function map_grid_kernel!(dest, f, x, y)
    J = @index(Global, NTuple)
    dest[J...] = f(x[J[1]], y[J[2]])
end

function map_grid!(dest, f, x, y)
    backend = get_backend(dest)
    f! = map_grid_kernel!(backend)
    f!(dest, f, x, y; ndrange=size(dest))
end

Then the following works just fine:

x = LinRange(-3, 3, 128)
y = copy(x)
dest = Array{Float64}(undef, length(x), length(y))

f(x...) = exp(-sum(abs2, x))

map_grid!(dest, f, x, y)

I’m having problems generalizing this to accept grids of arbitrary dimension. Although I’m able to write

@kernel function map_grid_arbitrary!(dest, f, rs...)
    J = @index(Global, NTuple)
    dest[J...] = f(ntuple(n -> rs[n][J[n]], ndims(dest))...)
end

function map_grid_arbitrary!(dest, f, rs...)
    backend = get_backend(dest)
    f! = map_grid_arbitrary!(backend)
    f!(dest, f, rs...; ndrange=size(dest))
end

this version allocates more then the previous one when called with the same arguments, and is also much slower. I couldn’t find a solution that matches the speed of the explicit 2D case. Does anyone has an idea of what can be done here? Thanks in advance!