OK, that example clarifies a lot. We don’t efficiently support what you’re doing there: broadcasting a function that does its own broadcasting over the inner 1D dataset. If that inner broadcast is coarse enough, i.e. if it does enough work to saturate the GPU, a mapslices-like approach might work (assuming we optimize the implementation of that API). But generally it will be more efficient to create a function that you can broadcast over the entire 2D dataset (not an array of arrays) in one go, e.g., like the batched fft interface in your first post. Some of your operations are element-wise anyway, and the final reduce
you can perform in a batched manner with the dims
keyword.
Don’t expect magic from the GPU. We implement broadcast
to compile to a single kernel; you’re broadcasting a function that does an FFT and further broadcasts; that’s never going to work as a single kernel on the GPU. If you stick to this approach, optimizing mapslices
would be the way to go. Else, you’d need to think about restructuring your operations to be dataset-wide, which would have advantages on the CPU too.
Except for the fact you’re writing high-level Julia code instead of CUDA C, you mean?