Cumulative sum on GPUArray using KernelAbstractions

albertomercurio · December 22, 2024, 5:12pm

Hello,

I need to implement the cumulative sum (cumsum) on a GPU array (CUDA.jl or Metal.jl). Looking at the CUDA.jl repository, I found this definition

function cumsum!(sums)
    shift = 1

    while shift < length(sums)
        to_add = 0
        @inbounds if threadIdx().x - shift > 0
            to_add = sums[threadIdx().x - shift]
        end

        sync_threads()
        @inbounds if threadIdx().x - shift > 0
            sums[threadIdx().x] += to_add
        end

        sync_threads()
        shift *= 2
    end
end

which can be executed inside a CUDA kernel.

Now, I need to implement it using KernelAbstractions.jl, but I’m not familiar with shared memory, especially with KernelAbstractions.jl. The easiest way I thought was something like

function cumsum!(sums)
    idx = @index(Global)
    shift = 1

    while shift < length(sums)
        to_add = 0
        @inbounds if idx - shift > 0
            to_add = sums[idx - shift]
        end

        KernelAbstractions.@syncronize()
        @inbounds if idx - shift > 0
            sums[idx] += to_add
        end

        KernelAbstractions.@syncronize()
        shift *= 2
    end
end

But I don’t know if this is correct, or if I have to use KernelAbstractions.@localmem or something else.

pitsianis · December 22, 2024, 5:23pm

This is a nice article that shows how to implement prefix scan in C/CUDA.

eldee · December 23, 2024, 9:26am

You (probably) don’t have to implement your own kernel for this: CUDA.jl and Metal.jl (as well as AMDGPU.jl) implement Base.accumulate for their gpu array types (CUDA, Metal (and AMDGPU )), so you could just use accumulate(+, gpu_array).

maleadt · December 23, 2024, 9:49am

And if you want to port this to KA.jl, the cumsum! function you’re looking at is an implementation detail from the sorting kernel, and not something reusable. A more general cumsum kernel can be found in CUDA.jl/src/accumulate.jl at 972f3f0a9d594c75431df96b027588f8279ad2de · JuliaGPU/CUDA.jl · GitHub, and resembles our mapreduce implementation, which means you can probably draw inspiration from the WIP mapreduce implementation with KA.jl in Implement mapreduce by vchuravy · Pull Request #561 · JuliaGPU/GPUArrays.jl · GitHub.

albertomercurio · December 24, 2024, 3:16pm

Thank you, it was very useful.

Yes, indeed. I will temporary use their own implementations.

Thanks you, I will give a look.

Topic		Replies	Views
Custom CUDA kernel for nontrivial sum calculations GPU cuda , reduction	2	827	September 20, 2022
How to "reduce" an array? General Usage cuda , arrays	19	1269	January 14, 2023
How to avoid memory allocation while doing sum on a GPU? General Usage cuda , memory-allocation , cudajl	7	126	April 20, 2025
Create a simple CUDA.sum kernel GPU	3	1963	January 3, 2021
Parallel prefix sum with CUDA.jl GPU	2	1301	June 3, 2021

Cumulative sum on GPUArray using KernelAbstractions

Related topics