CUDA sum kernels, threads and blocks, complex values

ozmaden · February 2, 2021, 9:08pm

Hi there, I was trying to implement a custom sum-dot-product-thingy using CUDA.jl kernels when I stumbled upon this documentation PR. It stems from a discussion had not too long ago here.

I probably do not understand some fundamentals when it comes to CUDA kernel programming. So please bare with me, I gathered my questions here and would appreciate any feedback/guidance.

Why is sum_atomic still so much slower than CUDA.sum? There is a short note in the end of the PR referring to threads accessing out but nothing concrete. How can sum_atomic be improved? Code from the PR link:

function run(kernel, arr)
    out = CUDA.zeros(eltype(arr))
    CUDA.@sync begin
        @cuda threads=128 blocks=1024 kernel(out, arr)
    end
    out[]
end

function sum_atomic(out, arr)
    index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = blockDim().x * gridDim().x
    acc = zero(eltype(out))
    for i = index:stride:length(arr)
        @inbounds acc += arr[i]
    end
    @atomic out[] += acc
    return nothing
end

@btime run(sum_atomic, arr) # 266.052 μs (46 allocations: 1.41 KiB)
@btime sum(arr) # 37.511 μs (62 allocations: 1.62 KiB)

Why is the number of threads specifically set to 128 and of blocks to 1024? Is this choice backed up by some fact? I too seem to get optimal results for this combination, however, isn’t the correct way to choose numblocks = ceil(Int, N/numthreads) according to the documentation? When I do this the performance is worse than threads=128 and blocks=1024. And shouldn’t numthreads * numblocks > length(out) hold? What am I missing?
Also: are complex-valued CuArrays not supported yet? I get the kernel returns a value of type Union{} error when initializing out as a CuArray{ComplexF32}

jw3126 · February 3, 2021, 12:25am

So first of all, I should mention, then when I made the PR I only had maybe a dozen hours of GPU programming experience. So this code has lots of problems and is not an example of how to do things. Furthermore, it was meant as a slow baseline, that should be improved over the course of the tutorial. Sadly I did not yet find time to update the PR.

To improve this code, one has to use shared memory. Hopefully, I find time to update it.
These numbers are in no way ideal. Some general recommendation from the CUDA docs I remember is that threads should be 32 << threads <= 1024 and blocks should be in the thousands.
I think blocks is indeed to high in the example. numthreads * numblocks > length(out) is not required. Each thread can loop over multiple elements in the code.
I guess atomic_add is not supported for complex numbers.

jw3126 · February 3, 2021, 12:32am

Here is how you can get started using shared memory:

using CUDA
function sum_block_kernel!(out, x)
    index = threadIdx().x
    stride = blockDim().x
    # @cushow index
    # @cushow stride
    acc = 0f0
    for i = index:stride:length(x)
        @inbounds acc += x[i]
    end
    shmem = @cuDynamicSharedMem(Float32, blockDim().x)
    shmem[index] = acc
    sync_threads()
    imax = blockDim().x ÷ 2
    while imax >= 1
        # if index === 1
        #     @cushow imax
        # end
        if index <= imax
            shmem[index] += shmem[index+imax]
        end
        imax = imax ÷ 2
        sync_threads()
    end
    if index === 1
        out[] = shmem[1]
    end
    nothing
end

function sum_single_block(x)
    threads = 512
    @assert length(x) == threads
    out = CUDA.zeros()
    CUDA.@sync begin
        shmem = sizeof(eltype(x)) * threads
        k = @cuda threads=threads blocks=1 shmem=shmem sum_block_kernel!(out, x)
    end
    out[]
end

x = CUDA.randn(512)

using Test
@test isapprox(sum_single_block(x), sum(x))

Topic		Replies	Views
Create a simple CUDA.sum kernel GPU	3	2008	January 3, 2021
CUDA atomics on complex-valued arrays GPU cuda	3	1295	April 12, 2021
Accessing array elements too slow? GPU	10	636	April 23, 2021
Synchronizing Cuda kernels GPU	5	2475	September 20, 2019
Organizing Threads and Block @cuda, 3D Arrays General Usage cudanative , cuda	2	2455	March 18, 2019

CUDA sum kernels, threads and blocks, complex values

Related topics