Map Performance with CuArrays

But you were mapping across a CPU collection of GPU arrays, so isn’t it expected that this will result into multiple separate calls to the GPU? If you map across a GPU dataset, you can use arbitrary functions and not be constrained to simple atomic actions as you mention. Furthermore, with broadcast fusion and using “dots” you can trivially create new and complex kernels that will be executed on the GPU in a single step. If you need more flexibility, you need to resort to writing your own kernels (which can be arbitrarily complex) or use something like GPUifyLoops for the vendor-agnostic equivalent.