Here’s my testing code
using SortingAlgorithms
a32 = rand(Float32, 10_000_000)
@benchmark sort($a32, alg=RadixSort)
using CUDA
cua32 = cu(a32)
@benchmark CUDA.@sync sort($cua32)
So the on my GPU which is RTX2080. CUDA sorting is not faster than radixsort. YMMV, obviously.
100million elements compared
So increasing the size of 100 million elements, and as you
GPU radix sort needed?
The issue might be that the sort as implemented in CUDA is quicksort whereas a radix sort implementation on GPU would be blazing fast. I was trying something along these lines but I think I got stuck somewhere with the older CUDA.
How to load Float64 onto GPU?
a64 = rand(10_000_000)
cua64 = cu(a64)
We can see that cua6
is CuArray{Float32}
. Anyway, GPUs like 32bit floats and have much better performance, so I can see why cu
does the implicit conversion. But it would be nice to have an option to copy the Float64
direct and cua64 = Float64.(cu(a64)
has obvious issues.
Anyhow, I tested it,
So I think it’s fair to say GPU sort as of now is no faster than he fastest available CPU sort. Also CPU sort tends to have access to more RAM vs GPU (I think), so CPU can handle more dataset anyway.
sort(a64, alg=RadixSort) == collect(sort(cua64))