Generic CPU and GPU loops with tiled iteration

datnamer · January 30, 2019, 10:11pm

Can Tilediteration be used to write loops that work on GPU and CPUs without having to have separate kernels for the latter or stuff everything into broadcast(s)?

@tim.holy

tim.holy · January 30, 2019, 11:56pm

Most likely, but I’m guessing it’s useful only if want your GPU kernels explicitly looping over blocks of pixels. If you do the typical one-output-pixel-per-thread then it may not be relevant.

TiledIteration’s main advantage is in reusing cache, “halide-like.”

datnamer · January 31, 2019, 12:31am

I don’t mean images necessarily, but any generic looping operation like matrix multiply or sum. Or do you mean a pixel more generically.

Like here: Writing fast stencil computation kernels that work on both CPUs and GPUs - #3 by maleadt

tim.holy · January 31, 2019, 3:55am

Yes, images are just arrays so I just mean processing the work associated with a particular output location. If your stencils are very local then you most likely want to use the raw blockIdx etc if you’re interested in maximal performance.

Topic		Replies	Views
Block/Tile-Based GPU Programming (not Scratch) GPU gpu , tile , block	2	216	April 6, 2025
Using CuArrays with Iterators.product() General Usage question , cuda	1	486	July 19, 2021
Fast tile search GPU	6	533	November 11, 2022
Generic way to write array stencils/kernels for CPUs and GPUs? General Usage	5	602	September 28, 2018
Understanding stride loop GPU	7	704	February 25, 2024

Generic CPU and GPU loops with tiled iteration

Related topics