Kernel for building histogram on GPU

I want to build histogram from large matrix of floats, for which a matrix of integers of the same size identify the bin in which each of those float must be allocated. It could also be seen as a groupby aggregation, but with a different index for each column.

The scatter_add! function developed here seems like a good basis. However, when comparing the performance with a multi-thread CPU implementation, it’s faily slower (2 to 5 times slower).

Example:

using CUDAnative
using CuArrays

# CPU
function hist_cpu!(hist, δ, idx)
    Threads.@threads for j in 1:size(idx,2)
        @inbounds for i in 1:size(idx,1)
            hist[idx[i], j] += δ[i,j]
        end
    end
    return
end

# GPU
function hist_gpu!(h::CuMatrix{T}, x::CuArray{T}, id::CuArray{Int}; MAX_THREADS=256) where {T<:AbstractFloat}
    function kernel!(h::CuDeviceArray{T}, x::CuDeviceArray{T}, id)
        i = threadIdx().x + (blockIdx().x - 1) * blockDim().x
        j = threadIdx().y + (blockIdx().y - 1) * blockDim().y
        @inbounds if i <= size(id, 1) && j <= size(h, 2)
            k = Base._to_linear_index(h, id[i,j], j)
            CUDAnative.atomic_add!(pointer(h, k), x[i,j])
        end
        return
    end
    thread_i = min(MAX_THREADS, size(id, 1))
    thread_j = min(MAX_THREADS ÷ thread_i, size(h, 2))
    threads = (thread_i, thread_j)
    blocks = ceil.(Int, (size(id, 1), size(h, 2)) ./ threads)
    CuArrays.@cuda blocks=blocks threads=threads kernel!(h, x, id)
    return h
end

Generate toy test:

nbins = 20
ncol = 100
items = Int(1e6)
hist = zeros(Float32, nbins, ncol)
δ = rand(Float32, items, ncol)
idx = rand(1:nbins, items, ncol)

hist_gpu = CuArray(hist)
δ_gpu = CuArray(δ)
idx_gpu = CuArray(idx)

@time hist_cpu!(hist, δ, idx)
# 0.021154 seconds (64 allocations: 7.063 KiB)
@CuArrays.time hist_gpu!(hist_gpu, δ_gpu, idx_gpu, MAX_THREADS=1024)
# 0.066522 seconds (36 CPU allocations: 1.141 KiB)

Are there inefficiencies in the implementation or is such kind of operartion inherently not well suited for GPU parallelisation? Or am I not oerforming the benchmark appropriately? The above was run on a 8 threads 3.5 GhZ CPU and a 1060 GPU.

I can’t add much, but a quick google uncovered this, which might help

1 Like

Doing global atomic additions from every thread, like that kernel does, is going to be very expensive (read: slow). The link above suggests doing so in shared memory first, which indeed is going to improve performance by a lot.

Thanks for pointing in a promising direction. I’ll roll up my sleeves to explore further the 2-step aggregation mentionned in the post. Might call again for help later:)

1 Like

@maleadt Initial explorations of shared memory along aggregation resulted in poor performance relative to the naive atomic_add! on global memory. I followed the exampe presented at slides 16 as a toy example, which is a simple dot product.

Below, the atomic_add performs roughly 2x faster than the shared memory implementation and it holds for vectors of length from 100k up to 10M.

Initial setup:

using CUDAnative
using CuArrays
using BenchmarkTools
N1 = 10_000_000
x1 = rand(N1)
x2 = rand(N1)
x1g = CuArray(x1)
x2g = CuArray(x2)

Atomic_add! on global memory:


function dot_atomic!(x, y, z)
    idx = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    if idx <= length(x)
        CUDAnative.atomic_add!(pointer(z,1), x[idx]*y[idx])
    end
    return nothing
end

function bench_dot_atomic!(x, y, z)
    numblocks = ceil(Int, N1/gthreads)
    CuArrays.@sync begin
        @cuda threads=gthreads blocks=numblocks dot_atomic!(x, y, z)
    end
end

gthreads = 512
numblocks = ceil(Int, N1/gthreads)
z0 = CuArray(zeros(1))
@cuda threads=gthreads blocks=numblocks dot_atomic!(x1g, x2g, z0)

@btime bench_dot_atomic!($x1g, $x2g, $z0)
17.236 ms (73 allocations: 2.14 KiB)

Shared Memory Implementation:

function dot_share!(x::CuDeviceVector{T}, y::CuDeviceVector{T}, z::CuDeviceVector{T}) where {T}

    tid = threadIdx().x
    idx = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    shared = CUDAnative.@cuStaticSharedMem(T, 512)
    fill!(shared, 0)

    if idx <= length(x)
        shared[tid] = x[idx] * y[idx]
    end
    sync_threads()

    i = blockDim().x / 2
    while i != 0
        if tid <= i
            @inbounds shared[tid] += shared[Cuint(tid+i)]
        end
        sync_threads()
        i ÷= 2
    end
    if tid == 1
        z[blockIdx().x] = shared[1]
    end
    return nothing
end

function bench_dot_share!(x, y, z)
    numblocks = ceil(Int, N1/gthreads)
    CuArrays.@sync begin
        @cuda threads=gthreads blocks=numblocks dot_share!(x, y, z)
    end
end

gthreads = 512
numblocks = ceil(Int, N1/gthreads)
z0 = CuArray(zeros(numblocks))
@cuda threads=gthreads blocks=numblocks dot_share!(x1g, x2g, z0)

@btime bench_dot_share!($x1g, $x2g, $z0)
38.025 ms (73 allocations: 2.14 KiB)

Was I mistaken thinking it would be the kind of scenario where shared memory would be useful? Or are there inefficiencies in its implementation?

Also, following the NVIDIA example provided by @ianshmean , I thought an atomic add on the shared memory would have worked in the aggregation loop: CUDAnative.atomic_add!(pointer(shared, tid), shared[Cuint(tid+i)]) but resulted in CUDA error: an illegal memory access was encountered (code 700, ERROR_ILLEGAL_ADDRESS).

I’m not sure if it’s tied to the above error, but is it normal that the shared memory actually behaves like if there was an implicit duplication on each of the block? For example, in z[blockIdx().x] = shared[1], there’s no explicit mention that shared[1] is from the associated blockIdx().x, although my understanding is that there’s a unique shared memory for each block. Any clarification on that would be much enlightening!

1 Like

You should use @inbounds if you know it to be safe – bounds checking is very expensive.

You are probably just going out of bounds here; atomic operations are currently not bounds checked.

Yes, shared memory is only shared within a single thread block as per the CUDA programming model. If you need to share data across blocks, you need to use global memory (but there’s no way to synchronize between them, so you need multiple kernels in that case).

2 Likes

Thanks for the tips. Adding inbounds, fixing i type stability through ÷ and reducing the number of threads per block resulted in a 4X speedup compared to the atomic_add! version.

I remain perplexe howver that the operation @inbounds shared[tid] += shared[tid+i] works but the following do not: CUDAnative.atomic_add!(pointer(shared, tid), shared[tid+i]). I understand that the error message points to an out of bound issue, yet the first expression appears strictly referring to the same elements,so I’m confused. Isn’t pointer(shared, tid) in atomic_add equivalent to shared[tid]?
Taking the following isolated case also crashes:

using CUDAnative
shared = CUDAnative.@cuStaticSharedMem(Float32, 8)
fill!(shared, 0)
CUDAnative.atomic_add!(pointer(shared, 1), 2.2f0)
Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x0 -- unknown function (ip: 0000000000000000)
in expression starting at none:0
unknown function (ip: 0000000000000000)
_ZN4llvm12SelectionDAG24AddModifiedNodeToCSEMapsEPNS_6SDNodeE at C:\Users\Jeremie\AppData\Local\Programs\Julia\Julia-1.4.1\bin\LLVM.dll (unknown line)
_ZN4llvm12SelectionDAG18ReplaceAllUsesWithENS_7SDValueES1_ at C:\Users\Jeremie\AppData\Local\Programs\Julia\Julia-1.4.1\bin\LLVM.dll (unknown line)
_ZN4llvm12SelectionDAG18ReplaceAllUsesWithEPNS_6SDNodeEPKNS_7SDValueE at C:\Users\Jeremie\AppData\Local\Programs\Julia\Julia-1.4.1\bin\LLVM.dll (unknown line)
...

Improved kernel for dot:

function dot_share!(x::CuDeviceVector{T}, y::CuDeviceVector{T}, z::CuDeviceVector{T}) where {T}

    tid = threadIdx().x
    idx = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    shared = CUDAnative.@cuStaticSharedMem(T, 64)
    fill!(shared, 0)
    sync_threads()

    if idx <= length(x)
        @inbounds shared[tid] = x[idx] * y[idx]
    end
    sync_threads()

    i = blockDim().x ÷ 2
    while i > 0
        if tid <= i
            @inbounds shared[tid] += shared[tid+i] # valid non atomic operation
            # CUDAnative.atomic_add!(pointer(shared, tid), shared[tid+i]) # invalid op - results in error
        end
        sync_threads()
        i ÷= 2
    end
    if tid == 1
        CUDAnative.atomic_add!(pointer(z,1), shared[1])
    end
    return nothing
end
1 Like

You can’t execute GPU code on the CPU, that will crash for sure.

You’re right though that there is something wrong with atomics and shared memory, I only tested them on global memory. I’ll file a bug.

1 Like

Should be fixed with https://github.com/JuliaGPU/CUDAnative.jl/pull/642

1 Like

Thanks for the quick fix, very appreciated!