Generic CPU and GPU loops with tiled iteration

I don’t mean images necessarily, but any generic looping operation like matrix multiply or sum. Or do you mean a pixel more generically.

Like here: Writing fast stencil computation kernels that work on both CPUs and GPUs - #3 by maleadt