How to sort an array based on another on GPU (CUDA) efficiently?

You seem to be using a pretty slow GPU:

julia> @btime CUDA.@sync reorder_vectors!(sorted_indices, vec1, vec2, vec3)
  327.501 μs (2140 allocations: 73.47 KiB)

In isolation, the CUDA.jl sort doesn’t look that bad:

julia> @btime sortperm!($(rand(Int, 100_000)), $(rand(Float32, 100_000)));
  3.231 ms (2 allocations: 781.30 KiB)

julia> @btime CUDA.@sync sortperm!($(CUDA.rand(Int, 100_000)), $(CUDA.rand(Float32, 100_000)));
  214.269 μs (1550 allocations: 53.52 KiB)

However, there’s something weird with your reorder_vectors! function. The broadcasts in there seem to somehow speed up CPU execution:

julia> function reorder_vectors!(sorted_indices, vec1, vec2, vec3)
           sortperm!(sorted_indices, vec1)
           vec1 .= vec1[sorted_indices]
           return
       end
reorder_vectors! (generic function with 1 method)

julia> @btime reorder_vectors!($(rand(Int, 100_000)), $(rand(Float32, 100_000)), $(rand(Float32, 100_000)), $(rand(Float32, 100_000)));
  171.899 μs (2 allocations: 390.67 KiB)


julia> function reorder_vectors!(sorted_indices, vec1, vec2, vec3)
           sortperm!(sorted_indices, vec1)
           return
       end
reorder_vectors! (generic function with 1 method)

julia> @btime reorder_vectors!($(rand(Int, 100_000)), $(rand(Float32, 100_000)), $(rand(Float32, 100_000)), $(rand(Float32, 100_000)));
  3.432 ms (2 allocations: 781.30 KiB)

I don’t have the time to investigate further, but if you can figure out an MWE where the CUDA sort performs much worse than the CPU sort (on reasonable input sizes; n=100_000 is relatively small) feel free to open an issue on the CUDA.jl bug tracker.

1 Like