Sort and collapse a vector with GPU

As my first attempt at something serious and practical I want to achieve with a GPU, I have a problem where I have an array of elements, each element is a tuple of (value, count_of_value). count_of_value is initially uniformly 1:

julia> arr2
50-element Array{Tuple{Int64,Int64},1}:
 (6, 1)
 (6, 1)
 (1, 1)
 (10, 1)
 (10, 1)
 (4, 1)
 (9, 1)
 (4, 1)
 (10, 1)
 (4, 1)
 (4, 1)
 (1, 1)
 (6, 1)
 (1, 1)
 (9, 1)
 (6, 1)
 (8, 1)
 (9, 1)
 (5, 1)
 (6, 1)
 (9, 1)
 (2, 1)
 (6, 1)
 (10, 1)
 (2, 1)
 (5, 1)
 (8, 1)
 (5, 1)
 (6, 1)
 (5, 1)
 (10, 1)
 (8, 1)
 (7, 1)
 (4, 1)
 (1, 1)
 (8, 1)
 (7, 1)
 (9, 1)
 (4, 1)
 (1, 1)
 (6, 1)
 (5, 1)
 (2, 1)
 (4, 1)
 (10, 1)
 (7, 1)
 (10, 1)
 (3, 1)
 (7, 1)
 (10, 1)

If I want to count these values, on the CPU, In my optimised code (not shown) I sort the vector, and then collapse runs of the same value, ending up with something that looks like this:

julia> sort!(arr2)
50-element Array{Tuple{Int64,Int64},1}:
 (1, 1)
 (1, 1)
 (1, 1)
 (1, 1)
 (1, 1)
 (2, 1)
 (2, 1)
 (2, 1)
 (3, 1)
 (4, 1)
 (4, 1)
 (4, 1)
 (4, 1)
 (4, 1)
 (4, 1)
 (4, 1)
 (5, 1)
 (5, 1)
 (5, 1)
 (5, 1)
 (5, 1)
 (6, 1)
 (6, 1)
 (6, 1)
 (6, 1)
 (6, 1)
 (6, 1)
 (6, 1)
 (6, 1)
 (7, 1)
 (7, 1)
 (7, 1)
 (7, 1)
 (8, 1)
 (8, 1)
 (8, 1)
 (8, 1)
 (9, 1)
 (9, 1)
 (9, 1)
 (9, 1)
 (9, 1)
 (10, 1)
 (10, 1)
 (10, 1)
 (10, 1)
 (10, 1)
 (10, 1)
 (10, 1)
 (10, 1)

julia> arr3 = [(y[1], count(x -> x == y, arr2)) for y in unique(arr2)] # NOT my optimised collapsing version
10-element Array{Tuple{Int64,Int64},1}:
 (1, 5)
 (2, 3)
 (3, 1)
 (4, 7)
 (5, 5)
 (6, 8)
 (7, 4)
 (8, 4)
 (9, 5)
 (10, 8)

I’m interested to see if this can be done with CUDA on the GPU, I’ve had a look at Kernel for building histogram on GPU - #11 by jeremiedb but the discussion starts talking about shared dot operations and the relevance gets away from me.

CUDA.sort in the REPL returns a reference to a function so I guess there is a CuArray version of sort for the GPU?

The next part I’m unsure of is if it’s possible to collapse an array on the GPU using a high-level function of if a custom kernel is nessecery.

There is not, it just returns Base.sort. There’s one in development though: https://github.com/JuliaGPU/CUDA.jl/pull/431

If the use case you present is representative of your real use case, then you actually don’t even need to resort on sort. Key requirement is that the first element in your tuple must represent the “bin” to which it should be assigned. If such is the case, then you can independently look at any tuple and add it to your histogram. This is AFAIK the fastest way to perform aggregation by group, such as getting the mean/sum by group of a dataframe.

By slighlty tweaking the kernel I’m using in EvoTrees for your vectorized case, I get the following:

# base kernel
function kernel!(h::CuDeviceVector{T}, x::CuDeviceVector{T}, xid::CuDeviceVector{S}) where {T,S}
    
    nbins = size(h, 1)
    it = threadIdx().x
    ib = blockIdx().x
    id = blockDim().x
    
    shared = @cuDynamicSharedMem(T, nbins)
    fill!(shared, 0)
    fill!(h, 0)
    sync_threads()

    i_tot = size(x, 1)
    iter = 0
    while iter * id < i_tot
        i = it + id * iter
        if i <= size(xid, 1)
            @inbounds k = Base._to_linear_index(h, xid[i])
            @inbounds CUDA.atomic_add!(pointer(shared, xid[i]), x[i])
        end
        iter += 1
    end
    sync_threads()
    # loop to cover cases where nbins > nthreads
    for i in 1:(nbins - 1) ÷ id + 1
        bin_id = it + id * (i - 1)
        if bin_id <= nbins
            @inbounds CUDA.atomic_add!(pointer(h, bin_id), shared[bin_id])
        end
    end
    return nothing
end

# base approach - block built along the cols first, the rows (limit collisions)
function hist!(h::AbstractVector{T}, x::AbstractVector{T}, xid::AbstractVector{S}; MAX_THREADS=256) where {T,S}
    threads = min(MAX_THREADS, size(xid, 1))
    @cuda blocks = 1 threads = threads shmem = sizeof(T) * size(h,1) kernel!(h, x, xid)
    return
end

Now a test data case with 10 bins. idx refers to the first element of the tuple, and x the second (I’ve filled with ones, perhaps using Int if it’s a count may be more relevant than the float I’ve used in your case).

nbins = 10
items = Int32(1e6)
hist = zeros(Float32, nbins)
idx = Int64.(rand(1:nbins, items))
x = ones(Float32, items)

hist_gpu = CuArray(hist)
x_gpu = CuArray(x)
idx_gpu = CuArray(idx)

julia> @btime CUDA.@sync hist!($hist_gpu, $x_gpu, $idx_gpu, MAX_THREADS=1024)
  3.446 ms (18 allocations: 448 bytes)

julia> hist_gpu
10-element CuArray{Float32,1}:
  99965.0
 100414.0
 100006.0
  99928.0
  99813.0
  99883.0
  99777.0
  99992.0
 100074.0
 100148.0

Note that only a single block is launched.
In my initial mental model about gpu, I thought it was a always better to launch a number of blocks to cover the entire matrix of data on which to operate, which has proven wrong in my little experience. In the above, a single block is launched, and the vector gets all covered through by iterating on the single block of threads. It may prove better to lunch more than a single block and have a fancier iteration, haven’t tested that yet!

In my use case, it is true that the bins are indeed the first values of the tuples. But in reality, I never know in advance how many bins exist.

The sort-collapse routine is a step in counting k-mers in a longer sequence: Make a vector of all the kmers, sort, collapse n’ count. This is because whilst all kmers are represented using Ints or UInts, for any k the number of possible observable values is 4^k, which for anything other than small k, gets huge, but crucially you don’t observe anywhere near that full range of values in your real sequences - because they’re real biological entities and are nowhere near random. So if I understand your code (And I probably don’t fully, GPU concepts are brand new to me), I would have a problem insofar as the array I’m accumulating the histogram in would need to be veeery big - maybe unmanageable - and would be basically mostly 0’s.