`CUDA.quicksort` is not faster than CPU radixsort

xiaodai · July 4, 2021, 12:36pm

Here’s my testing code

using SortingAlgorithms

a32 = rand(Float32, 10_000_000)

@benchmark sort($a32, alg=RadixSort)


using CUDA

cua32 = cu(a32)

@benchmark CUDA.@sync sort($cua32)

So the on my GPU which is RTX2080. CUDA sorting is not faster than radixsort. YMMV, obviously.

100million elements compared

So increasing the size of 100 million elements, and as you

GPU radix sort needed?

The issue might be that the sort as implemented in CUDA is quicksort whereas a radix sort implementation on GPU would be blazing fast. I was trying something along these lines but I think I got stuck somewhere with the older CUDA.

How to load Float64 onto GPU?

a64 = rand(10_000_000)

cua64 = cu(a64)

We can see that cua6 is CuArray{Float32}. Anyway, GPUs like 32bit floats and have much better performance, so I can see why cu does the implicit conversion. But it would be nice to have an option to copy the Float64 direct and cua64 = Float64.(cu(a64) has obvious issues.

Anyhow, I tested it,

So I think it’s fair to say GPU sort as of now is no faster than he fastest available CPU sort. Also CPU sort tends to have access to more RAM vs GPU (I think), so CPU can handle more dataset anyway.

sort(a64, alg=RadixSort) == collect(sort(cua64))

findmyway · July 4, 2021, 3:18pm

Try CuArray{Float64}(a64) instead.

eliassno · July 4, 2021, 3:28pm

I think CuArray(a64) is sufficient.

calebwin · July 4, 2021, 7:53pm

It’s still a useful operation though. If you are doing computation on the GPU that actually is faster than CPU, then it’s ideal to keep your data on the GPU and continue operating on it where it is instead of moving it to the CPU and doing the sort and then moving it back.

So if your workload is just a single sort - sure stick with CPU for now. But if you’re doing more than just sorting, then it can be quite useful to be able to sort your data without expensive transfers between CPU and GPU.

Topic		Replies	Views
GPU sort WIP (GPU 1000x faster than CPU? I must be doing something wrong) GPU sort	12	2187	January 31, 2019
How to sort an array based on another on GPU (CUDA) efficiently? GPU	3	369	March 17, 2024
Faster Sorting With GPU? New to Julia gpu , sort , arrays	25	2338	January 17, 2018
GPU Sort Function GPU question , gpuarrays , sort	20	4819	April 2, 2020
Does CUDA.jl have sortperm! implemented? General Usage	2	223	March 14, 2024

`CUDA.quicksort` is not faster than CPU radixsort

100million elements compared

GPU radix sort needed?

How to load Float64 onto GPU?

Related topics