CUDA sum kernels, threads and blocks, complex values

Hi there, I was trying to implement a custom sum-dot-product-thingy using CUDA.jl kernels when I stumbled upon this documentation PR. It stems from a discussion had not too long ago here.

I probably do not understand some fundamentals when it comes to CUDA kernel programming. So please bare with me, I gathered my questions here and would appreciate any feedback/guidance.

  1. Why is sum_atomic still so much slower than CUDA.sum? There is a short note in the end of the PR referring to threads accessing out but nothing concrete. How can sum_atomic be improved? Code from the PR link:
function run(kernel, arr)
    out = CUDA.zeros(eltype(arr))
    CUDA.@sync begin
        @cuda threads=128 blocks=1024 kernel(out, arr)
    end
    out[]
end

function sum_atomic(out, arr)
    index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = blockDim().x * gridDim().x
    acc = zero(eltype(out))
    for i = index:stride:length(arr)
        @inbounds acc += arr[i]
    end
    @atomic out[] += acc
    return nothing
end

@btime run(sum_atomic, arr) # 266.052 μs (46 allocations: 1.41 KiB)
@btime sum(arr) # 37.511 μs (62 allocations: 1.62 KiB)
  1. Why is the number of threads specifically set to 128 and of blocks to 1024? Is this choice backed up by some fact? I too seem to get optimal results for this combination, however, isn’t the correct way to choose numblocks = ceil(Int, N/numthreads) according to the documentation? When I do this the performance is worse than threads=128 and blocks=1024. And shouldn’t numthreads * numblocks > length(out) hold? What am I missing?

  2. Also: are complex-valued CuArrays not supported yet? I get the kernel returns a value of type Union{} error when initializing out as a CuArray{ComplexF32}

1 Like

So first of all, I should mention, then when I made the PR I only had maybe a dozen hours of GPU programming experience. So this code has lots of problems and is not an example of how to do things. Furthermore, it was meant as a slow baseline, that should be improved over the course of the tutorial. Sadly I did not yet find time to update the PR.

  1. To improve this code, one has to use shared memory. Hopefully, I find time to update it.

  2. These numbers are in no way ideal. Some general recommendation from the CUDA docs I remember is that threads should be 32 << threads <= 1024 and blocks should be in the thousands.
    I think blocks is indeed to high in the example. numthreads * numblocks > length(out) is not required. Each thread can loop over multiple elements in the code.

  3. I guess atomic_add is not supported for complex numbers.

1 Like

Here is how you can get started using shared memory:

using CUDA
function sum_block_kernel!(out, x)
    index = threadIdx().x
    stride = blockDim().x
    # @cushow index
    # @cushow stride
    acc = 0f0
    for i = index:stride:length(x)
        @inbounds acc += x[i]
    end
    shmem = @cuDynamicSharedMem(Float32, blockDim().x)
    shmem[index] = acc
    sync_threads()
    imax = blockDim().x ÷ 2
    while imax >= 1
        # if index === 1
        #     @cushow imax
        # end
        if index <= imax
            shmem[index] += shmem[index+imax]
        end
        imax = imax ÷ 2
        sync_threads()
    end
    if index === 1
        out[] = shmem[1]
    end
    nothing
end

function sum_single_block(x)
    threads = 512
    @assert length(x) == threads
    out = CUDA.zeros()
    CUDA.@sync begin
        shmem = sizeof(eltype(x)) * threads
        k = @cuda threads=threads blocks=1 shmem=shmem sum_block_kernel!(out, x)
    end
    out[]
end

x = CUDA.randn(512)

using Test
@test isapprox(sum_single_block(x), sum(x))
1 Like