Generic way to write array stencils/kernels for CPUs and GPUs?

Do we currently have an established way to write stencils for single-/multi-array operations, or kernels in general, in a way that’s compatible with both CPU and GPU array?

Regarding kernels/stencils in general, I found an old experiment of @timholy, [KernelTools.jl[(, and there was some support for stencils in ParallelAccelerator.jl, but I’m not sure about the current situation.

I guess stencils could be seen as a generalized form of broadcasting (resp. broadcasting as applying a multi-array stencil of size one - broadcasting-based code we can already write in a CPU/GPU independent fashion, of course).

I have some vague ideas, but maybe someone is already working on this kind of thing? has some stuff in this direction.

Oh, sure, but that’s 2D stencils with linear combination of fixed coefficients on a single array right?

Maybe I chose the wrong term - when I wrote stencils, I meant kernels that have a fixed access pattern (but may read from multiple arrays) run a fixed but arbitrarily complex operation on the input values and write the result to a single entry in a single target array. Maybe I should have just termed it kernel in general - but then, a kernel in the more generic meaning of the term (e.g. CUDA) isn’t always restricted to a fixed access pattern or a single output array.

Yeah its single array, but not restricted to 2d and (on CPU) allows arbitrary operations via mapwindow.

Oh, neat, good to know (does mapwindow use views?)

No, it copys to a buffer, so mapwindow(f!, arr) will not modify arr.