I was wondering if anyone has implemented a sort or partialsort function that is GPU friendly. Ultimately I need to compute statistical metrics of a dataset such as median or quartile that require sorting.
I saw it was mentioned previously, but hadn’t seen anything shared since.
I put together a naive GPU sorting algorithm with CuArrays that uses the even/odd technique.
I’m getting a scalar operation error for the comparison x[2] < x[1]. It seems like it should be okay to me since it is inside the function that I am mapping across.
Is there a better way to implement this simple algorithm with CuArrays?
using CuArrays
CuArrays.allowscalar(false)
function evenoddsort!(x::CuArray)
N = length(x)
steps = ceil(Integer, N / 2)
odds = [view(x, j:j+1) for j in 1:2:N-1]
evens = [view(x, j:j+1) for j in 2:2:N-1]
for i = 1:steps
# odd
map(_swap2!, odds)
# even
map(_swap2!, evens)
end
return nothing
end
# this unsafe function assumes that length(x) == 2
function _swap2!(x::AbstractVector)
if x[2] < x[1]
reverse!(x)
end
return nothing
end
a=cu(collect(5:-1:1));
evenoddsort!(a)
The error I see is ERROR: scalar getindex is disallowed in the _swap2! function on the line if x[2] < x[1].
I’m using reverse! to avoid getting the same error on that line as well. If I replace it with the explicit call x[1:2] = [x[2], x[1]] I see the same error on that line.
You’re right. The array comprehension is creating a CPU array (of GPU arrays). If I replace my problematic _swap2! function with a simple reverse! it runs, but the performance of evenoddsort! is terrible because of this.
It wasn’t obvious to me how I could accomplish this using broadcasting since the algorithm is pairwise. Would you be able to elaborate? I’m still getting the ropes here. Thanks again for the help!
I think there’s some progress here, but it’s still not quite working. To use your suggestion I had to change my _swap2! function to take 2 arguments instead of 1. Before I was creating adjacent, overlapping, pairs through the array comprehension calls. The new change created two new problems:
The inputs to _swap2 are now scalars so the changes are not being propagated outside the function - view seems to drop singleton dimensions and Julia inplace functions don’t seem to work on single values.
I’m now getting same ERROR: scalar getindex is disallowed but on the map call itself. This was what led me to that unsafe single array interface to _swap2! originally.
function evenoddsort!(x::AbstractArray)
N = length(x)
steps = ceil(Integer, N / 2)
odds = view(x, 1:2:N)
evens = view(x, 2:2:N)
for i = 1:steps
# odd
map(_swap2!, odds[1:floor(Integer, N / 2)], evens) # if even, this is okay, if odd drop first odd
# even
map(_swap2!, evens[1:(steps-1)], odds[2:end])
end
return nothing
end
function _swap2!(a, b)
if a > b
a, b = b, a
end
return nothing
end
Probably you can’t use view on the GPU. You’ll have to do the “pointer arithmetic” by hand somehow.
It may be that you just need to write your own kernel for this. ˜
You could try using KernelAbstractions.jl
julia> using CuArrays, CUDAnative
julia> function kernel(a)
b = view(a, 2:3)
b[1] = 1
return
end
kernel (generic function with 1 method)
julia> a = CuArrays.zeros(Int, 4)
4-element CuArray{Int64,1,Nothing}:
0
0
0
0
julia> @cuda kernel(a)
julia> a
4-element CuArray{Int64,1,Nothing}:
0
1
0
0
A sort function that launches several kernels to swap items at every step is likely not going to perform well though. You probably need to implement a custom kernel for this.
The following code based on your suggestion does work. It’s just real slow.
While view does work on the GPU in general, it doesn’t seem to in this case. The broadcast call oddview = pairview.(Ref(x), 1:2:N-1) doesn’t seem to keep the pointer data on the GPU, which makes the code very inefficient.
pairview(v, j) = view(v, j:j+1)
function evenoddsort!(x::AbstractArray)
N = length(x)
steps = ceil(Integer, N / 2)
oddview = pairview.(Ref(x), 1:2:N-1)
evenview = pairview.(Ref(x), 2:2:N-1)
for i = 1:steps
# odd
reverse!.(oddview)
# even
reverse!.(evenview)
end
return nothing
end
That’s what I was afraid of. I was hoping there was an easy way to implement this using standard Julia machinery, but I suppose that was a bit naive - if it were easy someone else probably would have done it already.
Maybe the next time I have some free cycles, I’ll try and work through the documentation on how to optimize custom kernels. Thanks for trying to help sort through this, everyone!
@maleadt What is your preferred way for us in the community to upvote feature requests for the Julia/GPU ecosystem? My applications would really benefit from being able to call sort! or partialsort! on a CuArray.
It seems like the parallel quicksort algorithms are a much better choice than the even-odd technique I’m attempting above. Base already uses a sequential quicksort algorithms as the default. The following links may be helpful:
After looking into this for a bit, this looks like a pretty big task. I won’t pretend I have the skill or the bandwidth to get this done in the immediate future. I’ll try and keep learning in my free time, though.
Best open an issue on CuArrays.jl, those are easier to keep track of.
A basic sorting implementation is fairly easy, e.g., limited to pow2-sizes inputs and only sorting a single dimension. The complexity is there when wanting to support all inputs and the full extent of the sorting interface. Maybe we should wrap some external libraries, like Thrust, for this. But that requires some complexity in the binary building department, so isn’t ideal either.