Generic CPU and GPU loops with tiled iteration


#1

Can Tilediteration be used to write loops that work on GPU and CPUs without having to have separate kernels for the latter or stuff everything into broadcast(s)?

@tim.holy


#2

Most likely, but I’m guessing it’s useful only if want your GPU kernels explicitly looping over blocks of pixels. If you do the typical one-output-pixel-per-thread then it may not be relevant.

TiledIteration’s main advantage is in reusing cache, “halide-like.”


#3

I don’t mean images necessarily, but any generic looping operation like matrix multiply or sum. Or do you mean a pixel more generically.

Like here: Writing fast stencil computation kernels that work on both CPUs and GPUs


#4

Yes, images are just arrays so I just mean processing the work associated with a particular output location. If your stencils are very local then you most likely want to use the raw blockIdx etc if you’re interested in maximal performance.