Kernel for building histogram on GPU

I want to build histogram from large matrix of floats, for which a matrix of integers of the same size identify the bin in which each of those float must be allocated. It could also be seen as a groupby aggregation, but with a different index for each column.

The scatter_add! function developed here seems like a good basis. However, when comparing the performance with a multi-thread CPU implementation, it’s faily slower (2 to 5 times slower).

Example:

using CUDAnative
using CuArrays

# CPU
function hist_cpu!(hist, δ, idx)
    Threads.@threads for j in 1:size(idx,2)
        @inbounds for i in 1:size(idx,1)
            hist[idx[i], j] += δ[i,j]
        end
    end
    return
end

# GPU
function hist_gpu!(h::CuMatrix{T}, x::CuArray{T}, id::CuArray{Int}; MAX_THREADS=256) where {T<:AbstractFloat}
    function kernel!(h::CuDeviceArray{T}, x::CuDeviceArray{T}, id)
        i = threadIdx().x + (blockIdx().x - 1) * blockDim().x
        j = threadIdx().y + (blockIdx().y - 1) * blockDim().y
        @inbounds if i <= size(id, 1) && j <= size(h, 2)
            k = Base._to_linear_index(h, id[i,j], j)
            CUDAnative.atomic_add!(pointer(h, k), x[i,j])
        end
        return
    end
    thread_i = min(MAX_THREADS, size(id, 1))
    thread_j = min(MAX_THREADS ÷ thread_i, size(h, 2))
    threads = (thread_i, thread_j)
    blocks = ceil.(Int, (size(id, 1), size(h, 2)) ./ threads)
    CuArrays.@cuda blocks=blocks threads=threads kernel!(h, x, id)
    return h
end

Generate toy test:

nbins = 20
ncol = 100
items = Int(1e6)
hist = zeros(Float32, nbins, ncol)
δ = rand(Float32, items, ncol)
idx = rand(1:nbins, items, ncol)

hist_gpu = CuArray(hist)
δ_gpu = CuArray(δ)
idx_gpu = CuArray(idx)

@time hist_cpu!(hist, δ, idx)
# 0.021154 seconds (64 allocations: 7.063 KiB)
@CuArrays.time hist_gpu!(hist_gpu, δ_gpu, idx_gpu, MAX_THREADS=1024)
# 0.066522 seconds (36 CPU allocations: 1.141 KiB)

Are there inefficiencies in the implementation or is such kind of operartion inherently not well suited for GPU parallelisation? Or am I not oerforming the benchmark appropriately? The above was run on a 8 threads 3.5 GhZ CPU and a 1060 GPU.

I can’t add much, but a quick google uncovered this, which might help

1 Like

Doing global atomic additions from every thread, like that kernel does, is going to be very expensive (read: slow). The link above suggests doing so in shared memory first, which indeed is going to improve performance by a lot.

Thanks for pointing in a promising direction. I’ll roll up my sleeves to explore further the 2-step aggregation mentionned in the post. Might call again for help later:)

1 Like