You seem to be using a pretty slow GPU:
julia> @btime CUDA.@sync reorder_vectors!(sorted_indices, vec1, vec2, vec3)
327.501 μs (2140 allocations: 73.47 KiB)
In isolation, the CUDA.jl sort
doesn’t look that bad:
julia> @btime sortperm!($(rand(Int, 100_000)), $(rand(Float32, 100_000)));
3.231 ms (2 allocations: 781.30 KiB)
julia> @btime CUDA.@sync sortperm!($(CUDA.rand(Int, 100_000)), $(CUDA.rand(Float32, 100_000)));
214.269 μs (1550 allocations: 53.52 KiB)
However, there’s something weird with your reorder_vectors!
function. The broadcasts in there seem to somehow speed up CPU execution:
julia> function reorder_vectors!(sorted_indices, vec1, vec2, vec3)
sortperm!(sorted_indices, vec1)
vec1 .= vec1[sorted_indices]
return
end
reorder_vectors! (generic function with 1 method)
julia> @btime reorder_vectors!($(rand(Int, 100_000)), $(rand(Float32, 100_000)), $(rand(Float32, 100_000)), $(rand(Float32, 100_000)));
171.899 μs (2 allocations: 390.67 KiB)
julia> function reorder_vectors!(sorted_indices, vec1, vec2, vec3)
sortperm!(sorted_indices, vec1)
return
end
reorder_vectors! (generic function with 1 method)
julia> @btime reorder_vectors!($(rand(Int, 100_000)), $(rand(Float32, 100_000)), $(rand(Float32, 100_000)), $(rand(Float32, 100_000)));
3.432 ms (2 allocations: 781.30 KiB)
I don’t have the time to investigate further, but if you can figure out an MWE where the CUDA sort performs much worse than the CPU sort (on reasonable input sizes; n=100_000
is relatively small) feel free to open an issue on the CUDA.jl bug tracker.