Can Tilediteration be used to write loops that work on GPU and CPUs without having to have separate kernels for the latter or stuff everything into broadcast(s)?
Most likely, but I’m guessing it’s useful only if want your GPU kernels explicitly looping over blocks of pixels. If you do the typical one-output-pixel-per-thread then it may not be relevant.
TiledIteration’s main advantage is in reusing cache, “halide-like.”
I don’t mean images necessarily, but any generic looping operation like matrix multiply or sum. Or do you mean a pixel more generically.
Like here: Writing fast stencil computation kernels that work on both CPUs and GPUs - #3 by maleadt
Yes, images are just arrays so I just mean processing the work associated with a particular output location. If your stencils are very local then you most likely want to use the raw blockIdx
etc if you’re interested in maximal performance.