GPU Sort Function

bmit · March 16, 2020, 10:12pm

I was wondering if anyone has implemented a sort or partialsort function that is GPU friendly. Ultimately I need to compute statistical metrics of a dataset such as median or quartile that require sorting.

I saw it was mentioned previously, but hadn’t seen anything shared since.

maleadt · March 17, 2020, 8:10am

Not that I know. Also see Wrapping Thrust for sorting · Issue #107 · JuliaGPU/CUDA.jl · GitHub

bmit · March 27, 2020, 9:26pm

I put together a naive GPU sorting algorithm with CuArrays that uses the even/odd technique.

I’m getting a scalar operation error for the comparison x[2] < x[1]. It seems like it should be okay to me since it is inside the function that I am mapping across.

Is there a better way to implement this simple algorithm with CuArrays?

using CuArrays
CuArrays.allowscalar(false)

function evenoddsort!(x::CuArray)

	N = length(x)
	steps = ceil(Integer, N / 2)
	odds = [view(x, j:j+1) for j in 1:2:N-1]
	evens = [view(x, j:j+1) for j in 2:2:N-1]

	for i = 1:steps
		# odd
		map(_swap2!, odds)
		# even
		map(_swap2!, evens)
	end
	return nothing
end

# this unsafe function assumes that length(x) == 2
function _swap2!(x::AbstractVector)
	if x[2] < x[1]
		reverse!(x)
	end
	return nothing
end

a=cu(collect(5:-1:1));
evenoddsort!(a)

dpsanders · March 28, 2020, 12:32am

What happens if you replace reverse!(x) with explicit operations?

dpsanders · March 28, 2020, 12:33am

In fact, what is the error message?

bmit · March 29, 2020, 5:27pm

The error I see is ERROR: scalar getindex is disallowed in the _swap2! function on the line if x[2] < x[1].

I’m using reverse! to avoid getting the same error on that line as well. If I replace it with the explicit call x[1:2] = [x[2], x[1]] I see the same error on that line.

dpsanders · March 30, 2020, 1:44pm

Thinking about it, I don’t think you can use array comprehensions on the GPU? Try using broadcasting instead.

bmit · March 30, 2020, 3:29pm

You’re right. The array comprehension is creating a CPU array (of GPU arrays). If I replace my problematic _swap2! function with a simple reverse! it runs, but the performance of evenoddsort! is terrible because of this.

It wasn’t obvious to me how I could accomplish this using broadcasting since the algorithm is pairwise. Would you be able to elaborate? I’m still getting the ropes here. Thanks again for the help!

dpsanders · March 30, 2020, 3:45pm

You can create all the views at once using broadcasting and then index using 1:2:end I think.

bmit · March 30, 2020, 6:10pm

I think there’s some progress here, but it’s still not quite working. To use your suggestion I had to change my _swap2! function to take 2 arguments instead of 1. Before I was creating adjacent, overlapping, pairs through the array comprehension calls. The new change created two new problems:

The inputs to _swap2 are now scalars so the changes are not being propagated outside the function - view seems to drop singleton dimensions and Julia inplace functions don’t seem to work on single values.
I’m now getting same ERROR: scalar getindex is disallowed but on the map call itself. This was what led me to that unsafe single array interface to _swap2! originally.

function evenoddsort!(x::AbstractArray)

	N = length(x)
	steps = ceil(Integer, N / 2)
	odds = view(x, 1:2:N)
	evens = view(x, 2:2:N)

	for i = 1:steps
		# odd
		map(_swap2!, odds[1:floor(Integer, N / 2)], evens)  # if even, this is okay, if odd drop first odd
		# even
		map(_swap2!, evens[1:(steps-1)], odds[2:end])
	end
	return nothing
end

function _swap2!(a, b)
	if a > b
		a, b = b, a
	end
	return nothing
end

dpsanders · March 31, 2020, 1:02am

I think you are still creating objects on the CPU. You need everything on the GPU.
I meant something like

julia> N = 10;

julia> v = CuArray(1:N);

julia> pairview(v, j) = view(v, j:j+1)
pairview (generic function with 1 method)

julia> views = pairview.(Ref(v), 1:N-1)
9-element Array{CuArray{Int64,1},1}:
 [1, 2]
 [2, 3]
 [3, 4]
 [4, 5]
 [5, 6]
 [6, 7]
 [7, 8]
 [8, 9]
 [9, 10]

Note that views is an array on the GPU.

dpsanders · March 31, 2020, 1:03am

And now you can do

julia> reverse!.(views)
9-element Array{CuArray{Int64,1},1}:
 [2, 3]
 [3, 4]
 [4, 5]
 [5, 6]
 [6, 7]
 [7, 8]
 [8, 9]
 [9, 10]
 [10, 1]

dpsanders · March 31, 2020, 1:04am

Hmm, that didn’t work…

dpsanders · March 31, 2020, 1:04am

Aghhh I’m wrong, views is not on the GPU, sorry.

dpsanders · March 31, 2020, 1:05am

Probably you can’t use view on the GPU. You’ll have to do the “pointer arithmetic” by hand somehow.
It may be that you just need to write your own kernel for this. ˜
You could try using KernelAbstractions.jl

maleadt · March 31, 2020, 7:15am

You can:

julia> using CuArrays, CUDAnative

julia> function kernel(a)
       b = view(a, 2:3)
       b[1] = 1
       return
       end
kernel (generic function with 1 method)

julia> a = CuArrays.zeros(Int, 4)
4-element CuArray{Int64,1,Nothing}:
 0
 0
 0
 0

julia> @cuda kernel(a)

julia> a
4-element CuArray{Int64,1,Nothing}:
 0
 1
 0
 0

A sort function that launches several kernels to swap items at every step is likely not going to perform well though. You probably need to implement a custom kernel for this.

bmit · March 31, 2020, 2:21pm

The following code based on your suggestion does work. It’s just real slow.

While view does work on the GPU in general, it doesn’t seem to in this case. The broadcast call oddview = pairview.(Ref(x), 1:2:N-1) doesn’t seem to keep the pointer data on the GPU, which makes the code very inefficient.

pairview(v, j) = view(v, j:j+1)

function evenoddsort!(x::AbstractArray)

	N = length(x)
	steps = ceil(Integer, N / 2)
	oddview = pairview.(Ref(x), 1:2:N-1)
	evenview = pairview.(Ref(x), 2:2:N-1)

	for i = 1:steps
		# odd
		reverse!.(oddview)
		# even
		reverse!.(evenview)
	end
	return nothing
end

bmit · March 31, 2020, 2:36pm

That’s what I was afraid of. I was hoping there was an easy way to implement this using standard Julia machinery, but I suppose that was a bit naive - if it were easy someone else probably would have done it already.

Maybe the next time I have some free cycles, I’ll try and work through the documentation on how to optimize custom kernels. Thanks for trying to help sort through this, everyone!

bmit · March 31, 2020, 8:36pm

@maleadt What is your preferred way for us in the community to upvote feature requests for the Julia/GPU ecosystem? My applications would really benefit from being able to call sort! or partialsort! on a CuArray.

It seems like the parallel quicksort algorithms are a much better choice than the even-odd technique I’m attempting above. Base already uses a sequential quicksort algorithms as the default. The following links may be helpful:

https://onlinelibrary.wiley.com/doi/full/10.1002/cpe.3611

After looking into this for a bit, this looks like a pretty big task. I won’t pretend I have the skill or the bandwidth to get this done in the immediate future. I’ll try and keep learning in my free time, though.

maleadt · April 2, 2020, 6:33am

Best open an issue on CuArrays.jl, those are easier to keep track of.

A basic sorting implementation is fairly easy, e.g., limited to pow2-sizes inputs and only sorting a single dimension. The complexity is there when wanting to support all inputs and the full extent of the sorting interface. Maybe we should wrap some external libraries, like Thrust, for this. But that requires some complexity in the binary building department, so isn’t ideal either.