Sort and collapse a vector with GPU

Ward9250 · November 24, 2020, 1:48pm

As my first attempt at something serious and practical I want to achieve with a GPU, I have a problem where I have an array of elements, each element is a tuple of (value, count_of_value). count_of_value is initially uniformly 1:

julia> arr2
50-element Array{Tuple{Int64,Int64},1}:
 (6, 1)
 (6, 1)
 (1, 1)
 (10, 1)
 (10, 1)
 (4, 1)
 (9, 1)
 (4, 1)
 (10, 1)
 (4, 1)
 (4, 1)
 (1, 1)
 (6, 1)
 (1, 1)
 (9, 1)
 (6, 1)
 (8, 1)
 (9, 1)
 (5, 1)
 (6, 1)
 (9, 1)
 (2, 1)
 (6, 1)
 (10, 1)
 (2, 1)
 (5, 1)
 (8, 1)
 (5, 1)
 (6, 1)
 (5, 1)
 (10, 1)
 (8, 1)
 (7, 1)
 (4, 1)
 (1, 1)
 (8, 1)
 (7, 1)
 (9, 1)
 (4, 1)
 (1, 1)
 (6, 1)
 (5, 1)
 (2, 1)
 (4, 1)
 (10, 1)
 (7, 1)
 (10, 1)
 (3, 1)
 (7, 1)
 (10, 1)

If I want to count these values, on the CPU, In my optimised code (not shown) I sort the vector, and then collapse runs of the same value, ending up with something that looks like this:

julia> sort!(arr2)
50-element Array{Tuple{Int64,Int64},1}:
 (1, 1)
 (1, 1)
 (1, 1)
 (1, 1)
 (1, 1)
 (2, 1)
 (2, 1)
 (2, 1)
 (3, 1)
 (4, 1)
 (4, 1)
 (4, 1)
 (4, 1)
 (4, 1)
 (4, 1)
 (4, 1)
 (5, 1)
 (5, 1)
 (5, 1)
 (5, 1)
 (5, 1)
 (6, 1)
 (6, 1)
 (6, 1)
 (6, 1)
 (6, 1)
 (6, 1)
 (6, 1)
 (6, 1)
 (7, 1)
 (7, 1)
 (7, 1)
 (7, 1)
 (8, 1)
 (8, 1)
 (8, 1)
 (8, 1)
 (9, 1)
 (9, 1)
 (9, 1)
 (9, 1)
 (9, 1)
 (10, 1)
 (10, 1)
 (10, 1)
 (10, 1)
 (10, 1)
 (10, 1)
 (10, 1)
 (10, 1)

julia> arr3 = [(y[1], count(x -> x == y, arr2)) for y in unique(arr2)] # NOT my optimised collapsing version
10-element Array{Tuple{Int64,Int64},1}:
 (1, 5)
 (2, 3)
 (3, 1)
 (4, 7)
 (5, 5)
 (6, 8)
 (7, 4)
 (8, 4)
 (9, 5)
 (10, 8)

I’m interested to see if this can be done with CUDA on the GPU, I’ve had a look at Kernel for building histogram on GPU - #11 by jeremiedb but the discussion starts talking about shared dot operations and the relevance gets away from me.

CUDA.sort in the REPL returns a reference to a function so I guess there is a CuArray version of sort for the GPU?

The next part I’m unsure of is if it’s possible to collapse an array on the GPU using a high-level function of if a custom kernel is nessecery.

maleadt · November 24, 2020, 3:03pm

There is not, it just returns Base.sort. There’s one in development though: https://github.com/JuliaGPU/CUDA.jl/pull/431

jeremiedb · November 25, 2020, 4:13am

If the use case you present is representative of your real use case, then you actually don’t even need to resort on sort. Key requirement is that the first element in your tuple must represent the “bin” to which it should be assigned. If such is the case, then you can independently look at any tuple and add it to your histogram. This is AFAIK the fastest way to perform aggregation by group, such as getting the mean/sum by group of a dataframe.

By slighlty tweaking the kernel I’m using in EvoTrees for your vectorized case, I get the following:

# base kernel
function kernel!(h::CuDeviceVector{T}, x::CuDeviceVector{T}, xid::CuDeviceVector{S}) where {T,S}
    
    nbins = size(h, 1)
    it = threadIdx().x
    ib = blockIdx().x
    id = blockDim().x
    
    shared = @cuDynamicSharedMem(T, nbins)
    fill!(shared, 0)
    fill!(h, 0)
    sync_threads()

    i_tot = size(x, 1)
    iter = 0
    while iter * id < i_tot
        i = it + id * iter
        if i <= size(xid, 1)
            @inbounds k = Base._to_linear_index(h, xid[i])
            @inbounds CUDA.atomic_add!(pointer(shared, xid[i]), x[i])
        end
        iter += 1
    end
    sync_threads()
    # loop to cover cases where nbins > nthreads
    for i in 1:(nbins - 1) ÷ id + 1
        bin_id = it + id * (i - 1)
        if bin_id <= nbins
            @inbounds CUDA.atomic_add!(pointer(h, bin_id), shared[bin_id])
        end
    end
    return nothing
end

# base approach - block built along the cols first, the rows (limit collisions)
function hist!(h::AbstractVector{T}, x::AbstractVector{T}, xid::AbstractVector{S}; MAX_THREADS=256) where {T,S}
    threads = min(MAX_THREADS, size(xid, 1))
    @cuda blocks = 1 threads = threads shmem = sizeof(T) * size(h,1) kernel!(h, x, xid)
    return
end

Now a test data case with 10 bins. idx refers to the first element of the tuple, and x the second (I’ve filled with ones, perhaps using Int if it’s a count may be more relevant than the float I’ve used in your case).

nbins = 10
items = Int32(1e6)
hist = zeros(Float32, nbins)
idx = Int64.(rand(1:nbins, items))
x = ones(Float32, items)

hist_gpu = CuArray(hist)
x_gpu = CuArray(x)
idx_gpu = CuArray(idx)

julia> @btime CUDA.@sync hist!($hist_gpu, $x_gpu, $idx_gpu, MAX_THREADS=1024)
  3.446 ms (18 allocations: 448 bytes)

julia> hist_gpu
10-element CuArray{Float32,1}:
  99965.0
 100414.0
 100006.0
  99928.0
  99813.0
  99883.0
  99777.0
  99992.0
 100074.0
 100148.0

Note that only a single block is launched.
In my initial mental model about gpu, I thought it was a always better to launch a number of blocks to cover the entire matrix of data on which to operate, which has proven wrong in my little experience. In the above, a single block is launched, and the vector gets all covered through by iterating on the single block of threads. It may prove better to lunch more than a single block and have a fancier iteration, haven’t tested that yet!

Ward9250 · November 26, 2020, 2:36am

In my use case, it is true that the bins are indeed the first values of the tuples. But in reality, I never know in advance how many bins exist.

The sort-collapse routine is a step in counting k-mers in a longer sequence: Make a vector of all the kmers, sort, collapse n’ count. This is because whilst all kmers are represented using Ints or UInts, for any k the number of possible observable values is 4^k, which for anything other than small k, gets huge, but crucially you don’t observe anywhere near that full range of values in your real sequences - because they’re real biological entities and are nowhere near random. So if I understand your code (And I probably don’t fully, GPU concepts are brand new to me), I would have a problem insofar as the array I’m accumulating the histogram in would need to be veeery big - maybe unmanageable - and would be basically mostly 0’s.

Topic		Replies	Views
GPU Sort Function GPU question , gpuarrays , sort	20	4916	April 2, 2020
GPU sort WIP (GPU 1000x faster than CPU? I must be doing something wrong) GPU sort	12	2228	January 31, 2019
Faster Sorting With GPU? New to Julia gpu , sort , arrays	25	2394	January 17, 2018
[blog post] Introduction to GPU programming Community gpu , cudanative , gpuarrays , blog-post	15	3379	December 20, 2018
Sorting common elements into bins General Usage	10	1742	March 8, 2018

Sort and collapse a vector with GPU

Related topics