Faster Sorting With GPU?

I’ve written a program which needs to sort massive multi-dimensional arrays (several million rows). The problem is, as you know, QuickSort is not all that quick. I was wondering if anyone knows of a way to outsource the sort function to the GPU instead? It occurs to me that the GPU is essentially doing this nonstop with z-buffering anyways and is obviously considerably better optimized for this sort of task.

1 Like

ArrayFire has one!

Ok so it looks like AFArray is formatted a bit differently than what I’m used to. My original array was initialized like this:
Data=Array{Any}(length(rows,columns)
And called by:
Data[row#,column#]
How does the format change for AFArray?

Don’t use arrays of Any if you want performance. In fact, seeing this, I would recommend typing this better and just trying it on the CPU again.

1 Like

Don’t use indexing with GPUs.

Oh, ok. So the reason I used type Any was because I have columns in the array that contain bools. Would it be more efficient to just keep the floats in one array and the bools in another and use the sorting order from the float array to organize the bools?

How many columns do you have? It may be faster to create a vector of Tuples and sort that, or use something like DataFrames.jl or JuliaDB.jl.

perhaps you can post a small example of your data? I am working on sorting algorithms so I may have some ideas.

Do you need to sort on multikeys keys or just one key?

It’s a pretty simple setup, just a massive data set. I have 8 columns (plus 1 column of bools which I’ve cut off for the performance reasons mentioned) and anywhere from 1 million to 12 million rows. I’m using the sortrows function and I’m organizing the data first according to the 2nd column, then the 3rd, then the 1st.

Could you post explicit code that generates something like your data (e.g. using rand for data generation), and explicit code that you think is too slow?

Have you thought about whether Float32 is enough? Also, are you sure you don’t want to transpose your data? In Julia, accessing an entire column is fast, and accessing an entire row is slow. Hence sortrows should be avoided if possible.

However, compared to the use of Any, none of this matters.

what are the 8 columns? are they integers, floats? and how many distinct values are in each of the columns? Can we assume anything about the distributions of numbers in these columns? Or we don’t know, can we assume that they are completely random and span the whole range?

The reason why I ask is, say one of the columns can only contains values from 1 to 1000 then there are faster algorithms than quicksort (e.g. counting sort).

They are all floats and nothing’s absolute but it would be reasonable to assume that they all fall between -100 and 100.

for floats between -100 and 100 isn’t as useful as integer. Therefore you can use float16 to store the values?

I’ll look into it.

12 million rows isn’t that big nowadays. what would be ur definition of fast? sub 1 second or sub 10 second sort?

Preferably sub 1 second-I may have to repeat the sort function hundreds of thousands of times.

Have you tested the timing with it being strictly typed yet?

2 Likes

Can you post your code for how you use sortrows?

After taking a short look into sort.jl, the code for sortrows and sortcols appears to simply suck (so don’t use it if you care for performance).

Consider the following quick and dirty code, which should do the same:

using StaticArrays

function Base.isless(a::SVector, b::SVector)
#may segfault on bad input; too lazy to do right now
       @inbounds for i in 1:length(a)
       a[i] < b[i] && return true
       a[i] > b[i] && return false
       end
       return false
       end

A= rand(SVector{8,Float32},10_000_000);
A2 = rand(Float32, 8, 10_000_000);

#post warm-up
@time sortcols(A2);
 26.563095 seconds (10.12 M allocations: 921.714 MiB, 3.65% gc time)
@time sort(A; alg=QuickSort);
  3.976193 seconds (13 allocations: 305.176 MiB, 1.73% gc time)
@time sort(A; alg=MergeSort);
  5.784950 seconds (15 allocations: 457.764 MiB, 0.07% gc time)

edit: On julia 0.62, might be better on 0.7.

you could also do:

julia> A2 = rand(Float32, 8, 10_000_000)
julia> Atup = reinterpret(NTuple{8, Float32}, A2, (size(A2, 2),));

julia>sort(Atup; alg=QuickSort)# compile

julia> @time sort(Atup; alg=QuickSort);
  2.695609 seconds (13 allocations: 305.176 MiB, 1.50% gc time)
1 Like