Parallel prefix sum with CUDA.jl

Is there a “parallel prefix sum” function implemented in CUDA.jl? If not, I’d offer to write one, but am relatively new to Julia and would need some examples/documentation on how to do that.

Take a look at CUDA.scan!, which is mentioned in the docs only in passing, and is implemented here. There are a few outstanding performance TODOs, and as it stands, its performance on my RTX 2060 GPU is roughly on par with single-threaded performance on my AMD Ryzen 2 CPU. The large number of allocations throws up a red flag, but I’m not sure where they’re coming from.

julia> cA = CUDA.rand(2^20); cB = similar(cA);

julia> @btime CUDA.@sync last(CUDA.scan!(+, $cB, $cA, dims=1))
  705.100 μs (312 allocations: 8.34 KiB)
524474.7f0

julia> @btime CUDA.@sync sum($cA);
  231.000 μs (100 allocations: 2.28 KiB)

julia> @btime sum(A) setup=(A = Array(cA))
  241.500 μs (1 allocation: 16 bytes)
524474.75f0

Yeah, scan! hasn’t seen as much optimization as mapreducedim! has. That said, a 2060 isn’t that powerful, comparing a RTX 5000 with a 5950X I get 175/35/135us respectively. scan! also fares a little better when there’s more parallelism, e.g. reducing a 2^7x2^7^2^6 array takes 90us vs. 120us on the CPU. And 2^20 elements is only 4MB, ramping it up to 2^30/4GiB further shows that scan! needs some work, taking 150ms vs 135ms on the CPU (but only 15ms when using sum/mapreducedim!).