Is there a “parallel prefix sum” function implemented in CUDA.jl? If not, I’d offer to write one, but am relatively new to Julia and would need some examples/documentation on how to do that.
Take a look at
CUDA.scan!, which is mentioned in the docs only in passing, and is implemented here. There are a few outstanding performance TODOs, and as it stands, its performance on my RTX 2060 GPU is roughly on par with single-threaded performance on my AMD Ryzen 2 CPU. The large number of allocations throws up a red flag, but I’m not sure where they’re coming from.
julia> cA = CUDA.rand(2^20); cB = similar(cA); julia> @btime CUDA.@sync last(CUDA.scan!(+, $cB, $cA, dims=1)) 705.100 μs (312 allocations: 8.34 KiB) 524474.7f0 julia> @btime CUDA.@sync sum($cA); 231.000 μs (100 allocations: 2.28 KiB) julia> @btime sum(A) setup=(A = Array(cA)) 241.500 μs (1 allocation: 16 bytes) 524474.75f0
scan! hasn’t seen as much optimization as
mapreducedim! has. That said, a 2060 isn’t that powerful, comparing a RTX 5000 with a 5950X I get 175/35/135us respectively.
scan! also fares a little better when there’s more parallelism, e.g. reducing a 2^7x2^7^2^6 array takes 90us vs. 120us on the CPU. And 2^20 elements is only 4MB, ramping it up to 2^30/4GiB further shows that
scan! needs some work, taking 150ms vs 135ms on the CPU (but only 15ms when using