Can Tilediteration be used to write loops that work on GPU and CPUs without having to have separate kernels for the latter or stuff everything into broadcast(s)? @tim.holy

Most likely, but I’m guessing it’s useful only if want your GPU kernels explicitly looping over blocks of pixels. If you do the typical one-output-pixel-per-thread then it may not be relevant. TiledIteration’s main advantage is in reusing cache, “halide-like.”

[image] tim.holy: GPU kernels explicitly looping over blocks of pixels. If you do the typical one-output-pixel-per-thread then it may not be relevant. TiledIteration’s main advantage is in reusing cache, “ha I don’t mean images necessarily, but any generic looping operation like matrix multi…

Yes, images are just arrays so I just mean processing the work associated with a particular output location. If your stencils are very local then you most likely want to use the raw blockIdx etc if you’re interested in maximal performance.

Generic CPU and GPU loops with tiled iteration

Specific Domains GPU

datnamer January 31, 2019, 12:31am 3

I don’t mean images necessarily, but any generic looping operation like matrix multiply or sum. Or do you mean a pixel more generically.

Like here: Writing fast stencil computation kernels that work on both CPUs and GPUs - #3 by maleadt

Topic		Replies	Views
Block/Tile-Based GPU Programming (not Scratch) GPU gpu , tile , block	3	761	December 8, 2025
Using CuArrays with Iterators.product() General Usage question , cuda	1	533	July 19, 2021
Fast tile search GPU	6	632	November 11, 2022
Generic way to write array stencils/kernels for CPUs and GPUs? General Usage	5	687	September 28, 2018
Understanding stride loop GPU	7	867	February 25, 2024

Generic CPU and GPU loops with tiled iteration

Related topics