Faster Sorting With GPU?

judsoncs · January 16, 2018, 5:55pm

I’ve written a program which needs to sort massive multi-dimensional arrays (several million rows). The problem is, as you know, QuickSort is not all that quick. I was wondering if anyone knows of a way to outsource the sort function to the GPU instead? It occurs to me that the GPU is essentially doing this nonstop with z-buffering anyways and is obviously considerably better optimized for this sort of task.

rveltz · January 16, 2018, 7:28pm

ArrayFire has one!

judsoncs · January 16, 2018, 8:19pm

Ok so it looks like AFArray is formatted a bit differently than what I’m used to. My original array was initialized like this:
Data=Array{Any}(length(rows,columns)
And called by:
Data[row#,column#]
How does the format change for AFArray?

ChrisRackauckas · January 16, 2018, 8:36pm

Don’t use arrays of Any if you want performance. In fact, seeing this, I would recommend typing this better and just trying it on the CPU again.

ChrisRackauckas · January 16, 2018, 8:36pm

Don’t use indexing with GPUs.

judsoncs · January 16, 2018, 8:48pm

Oh, ok. So the reason I used type Any was because I have columns in the array that contain bools. Would it be more efficient to just keep the floats in one array and the bools in another and use the sorting order from the float array to organize the bools?

simonbyrne · January 16, 2018, 8:50pm

How many columns do you have? It may be faster to create a vector of Tuples and sort that, or use something like DataFrames.jl or JuliaDB.jl.

xiaodai · January 16, 2018, 8:53pm

perhaps you can post a small example of your data? I am working on sorting algorithms so I may have some ideas.

Do you need to sort on multikeys keys or just one key?

judsoncs · January 16, 2018, 9:08pm

It’s a pretty simple setup, just a massive data set. I have 8 columns (plus 1 column of bools which I’ve cut off for the performance reasons mentioned) and anywhere from 1 million to 12 million rows. I’m using the sortrows function and I’m organizing the data first according to the 2nd column, then the 3rd, then the 1st.

foobar_lv2 · January 16, 2018, 10:07pm

Could you post explicit code that generates something like your data (e.g. using rand for data generation), and explicit code that you think is too slow?

Have you thought about whether Float32 is enough? Also, are you sure you don’t want to transpose your data? In Julia, accessing an entire column is fast, and accessing an entire row is slow. Hence sortrows should be avoided if possible.

However, compared to the use of Any, none of this matters.

xiaodai · January 16, 2018, 11:17pm

what are the 8 columns? are they integers, floats? and how many distinct values are in each of the columns? Can we assume anything about the distributions of numbers in these columns? Or we don’t know, can we assume that they are completely random and span the whole range?

The reason why I ask is, say one of the columns can only contains values from 1 to 1000 then there are faster algorithms than quicksort (e.g. counting sort).

judsoncs · January 17, 2018, 12:33am

They are all floats and nothing’s absolute but it would be reasonable to assume that they all fall between -100 and 100.

xiaodai · January 17, 2018, 12:41am

for floats between -100 and 100 isn’t as useful as integer. Therefore you can use float16 to store the values?

judsoncs · January 17, 2018, 12:59am

I’ll look into it.

xiaodai · January 17, 2018, 1:05am

12 million rows isn’t that big nowadays. what would be ur definition of fast? sub 1 second or sub 10 second sort?

judsoncs · January 17, 2018, 1:38am

Preferably sub 1 second-I may have to repeat the sort function hundreds of thousands of times.

ChrisRackauckas · January 17, 2018, 1:44am

Have you tested the timing with it being strictly typed yet?

xiaodai · January 17, 2018, 11:21am

Can you post your code for how you use sortrows?

foobar_lv2 · January 17, 2018, 1:30pm

After taking a short look into sort.jl, the code for sortrows and sortcols appears to simply suck (so don’t use it if you care for performance).

Consider the following quick and dirty code, which should do the same:

using StaticArrays

function Base.isless(a::SVector, b::SVector)
#may segfault on bad input; too lazy to do right now
       @inbounds for i in 1:length(a)
       a[i] < b[i] && return true
       a[i] > b[i] && return false
       end
       return false
       end

A= rand(SVector{8,Float32},10_000_000);
A2 = rand(Float32, 8, 10_000_000);

#post warm-up
@time sortcols(A2);
 26.563095 seconds (10.12 M allocations: 921.714 MiB, 3.65% gc time)
@time sort(A; alg=QuickSort);
  3.976193 seconds (13 allocations: 305.176 MiB, 1.73% gc time)
@time sort(A; alg=MergeSort);
  5.784950 seconds (15 allocations: 457.764 MiB, 0.07% gc time)

edit: On julia 0.62, might be better on 0.7.

sdanisch · January 17, 2018, 1:51pm

you could also do:

julia> A2 = rand(Float32, 8, 10_000_000)
julia> Atup = reinterpret(NTuple{8, Float32}, A2, (size(A2, 2),));

julia>sort(Atup; alg=QuickSort)# compile

julia> @time sort(Atup; alg=QuickSort);
  2.695609 seconds (13 allocations: 305.176 MiB, 1.50% gc time)

Topic		Replies	Views
GPU Sort Function GPU question , gpuarrays , sort	20	4843	April 2, 2020
`CUDA.quicksort` is not faster than CPU radixsort GPU performance	3	1269	July 4, 2021
How to sort an array based on another on GPU (CUDA) efficiently? GPU	3	378	March 17, 2024
GPU sort WIP (GPU 1000x faster than CPU? I must be doing something wrong) GPU sort	12	2198	January 31, 2019
Sort on tuple is > 4x as fast as sortrows on the equivalent array General Usage performance , sort	2	662	September 13, 2017

Faster Sorting With GPU?

Related topics