`CUDA.quicksort` is not faster than CPU radixsort

Here’s my testing code

using SortingAlgorithms

a32 = rand(Float32, 10_000_000)

@benchmark sort($a32, alg=RadixSort)


using CUDA

cua32 = cu(a32)

@benchmark CUDA.@sync sort($cua32)

So the on my GPU which is RTX2080. CUDA sorting is not faster than radixsort. YMMV, obviously.

100million elements compared

So increasing the size of 100 million elements, and as you

GPU radix sort needed?

The issue might be that the sort as implemented in CUDA is quicksort whereas a radix sort implementation on GPU would be blazing fast. I was trying something along these lines but I think I got stuck somewhere with the older CUDA.

How to load Float64 onto GPU?

a64 = rand(10_000_000)

cua64 = cu(a64)

We can see that cua6 is CuArray{Float32}. Anyway, GPUs like 32bit floats and have much better performance, so I can see why cu does the implicit conversion. But it would be nice to have an option to copy the Float64 direct and cua64 = Float64.(cu(a64) has obvious issues.

Anyhow, I tested it,

So I think it’s fair to say GPU sort as of now is no faster than he fastest available CPU sort. Also CPU sort tends to have access to more RAM vs GPU (I think), so CPU can handle more dataset anyway.

sort(a64, alg=RadixSort) == collect(sort(cua64))

Try CuArray{Float64}(a64) instead.

1 Like

I think CuArray(a64) is sufficient.

1 Like

It’s still a useful operation though. If you are doing computation on the GPU that actually is faster than CPU, then it’s ideal to keep your data on the GPU and continue operating on it where it is instead of moving it to the CPU and doing the sort and then moving it back.

So if your workload is just a single sort - sure stick with CPU for now. But if you’re doing more than just sorting, then it can be quite useful to be able to sort your data without expensive transfers between CPU and GPU.

1 Like