Map Performance with CuArrays

maleadt · January 30, 2020, 7:25am

OK, that example clarifies a lot. We don’t efficiently support what you’re doing there: broadcasting a function that does its own broadcasting over the inner 1D dataset. If that inner broadcast is coarse enough, i.e. if it does enough work to saturate the GPU, a mapslices-like approach might work (assuming we optimize the implementation of that API). But generally it will be more efficient to create a function that you can broadcast over the entire 2D dataset (not an array of arrays) in one go, e.g., like the batched fft interface in your first post. Some of your operations are element-wise anyway, and the final reduce you can perform in a batched manner with the dims keyword.

Don’t expect magic from the GPU. We implement broadcast to compile to a single kernel; you’re broadcasting a function that does an FFT and further broadcasts; that’s never going to work as a single kernel on the GPU. If you stick to this approach, optimizing mapslices would be the way to go. Else, you’d need to think about restructuring your operations to be dataset-wide, which would have advantages on the CPU too.

Except for the fact you’re writing high-level Julia code instead of CUDA C, you mean?

Topic		Replies	Views
Performance of view with cuArrays GPU	11	2671	November 11, 2018
Mapping functions using CuArrays GPU cuarrays	2	1901	September 9, 2020
CUDA arrays not working well with broadcast!(), and other in-place operations inside a loop GPU gpu , broadcast , loops	4	738	June 1, 2022
GPU Map without reduction on multiple arrays indices GPU	1	693	February 8, 2019
Calculate FFT on GPU for every row of a 2D array Performance gpu	2	1024	August 14, 2018

Map Performance with CuArrays

Related topics