Parallel prefix sum with CUDA.jl

adam246 · June 3, 2021, 4:26am

Is there a “parallel prefix sum” function implemented in CUDA.jl? If not, I’d offer to write one, but am relatively new to Julia and would need some examples/documentation on how to do that.

stillyslalom · June 3, 2021, 5:28am

Take a look at CUDA.scan!, which is mentioned in the docs only in passing, and is implemented here. There are a few outstanding performance TODOs, and as it stands, its performance on my RTX 2060 GPU is roughly on par with single-threaded performance on my AMD Ryzen 2 CPU. The large number of allocations throws up a red flag, but I’m not sure where they’re coming from.

julia> cA = CUDA.rand(2^20); cB = similar(cA);

julia> @btime CUDA.@sync last(CUDA.scan!(+, $cB, $cA, dims=1))
  705.100 μs (312 allocations: 8.34 KiB)
524474.7f0

julia> @btime CUDA.@sync sum($cA);
  231.000 μs (100 allocations: 2.28 KiB)

julia> @btime sum(A) setup=(A = Array(cA))
  241.500 μs (1 allocation: 16 bytes)
524474.75f0

maleadt · June 3, 2021, 12:08pm

Yeah, scan! hasn’t seen as much optimization as mapreducedim! has. That said, a 2060 isn’t that powerful, comparing a RTX 5000 with a 5950X I get 175/35/135us respectively. scan! also fares a little better when there’s more parallelism, e.g. reducing a 2^7x2^7^2^6 array takes 90us vs. 120us on the CPU. And 2^20 elements is only 4MB, ramping it up to 2^30/4GiB further shows that scan! needs some work, taking 150ms vs 135ms on the CPU (but only 15ms when using sum/mapreducedim!).

Topic		Replies	Views
Summing a vector is faster than summing a multi-dimensional array of the same length using CUDA General Usage cuda	2	130	July 7, 2024
How to avoid memory allocation while doing sum on a GPU? General Usage cuda , memory-allocation , cudajl	7	126	April 20, 2025
Julia alternative to jax.scan General Usage	2	591	September 27, 2022
Cumulative sum on GPUArray using KernelAbstractions GPU gpu , gpuarrays , kernelabstractions	4	226	December 24, 2024
Performance regression with GPUArrays subset sum GPU	9	706	December 9, 2020

Parallel prefix sum with CUDA.jl

Related topics