How to sort an array based on another on GPU (CUDA) efficiently?

Hello!

I have the following toy-example below:

using CUDA
using BenchmarkTools

function reorder_vectors!(sorted_indices, vec1, vec2, vec3)
    CUDA.sortperm!(sorted_indices, vec1)
    vec1 .= vec1[sorted_indices]
    vec2 .= vec2[sorted_indices]
    vec3 .= vec3[sorted_indices]
end

# Initialize vectors
n = 100_000
vec1 = CUDA.rand(n)
vec2 = CUDA.rand(n)
vec3 = CUDA.rand(n)
sorted_indices = CUDA.zeros(Int, n)

# Benchmark memory allocation and execution time
mem_allocated = CUDA.@allocated reorder_vectors!(sorted_indices, vec1, vec2, vec3)
execution_time = @benchmark CUDA.@sync reorder_vectors!($sorted_indices, $vec1, $vec2, $vec3)

println("GPU Memory allocated: $mem_allocated bytes")
display(execution_time)

When I run it I get a quite slow (in my opinion) and allocating result on GPU:

GPU Memory allocated: 1500063 bytes
BenchmarkTools.Trial: 2584 samples with 1 evaluation.
 Range (min … max):  1.797 ms …  37.934 ms  β”Š GC (min … max): 0.00% … 33.40%
 Time  (median):     1.848 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   1.928 ms Β± 989.787 ΞΌs  β”Š GC (mean Β± Οƒ):  0.50% Β±  0.92%

    β–‚β–†β–ˆβ–…β–
  β–‚β–„β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–…β–…β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–‚β–ƒβ–ƒβ–‚β–ƒβ–‚β–ƒβ–‚β–ƒβ–‚β–‚β–‚β–‚β–‚β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–‚β–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–β–‚β–‚β–‚ β–ƒ
  1.8 ms          Histogram: frequency by time        2.34 ms <

 Memory estimate: 53.36 KiB, allocs estimate: 1489.

Is this an anti-pattern on GPU and sorting based on another set of indices should be avoided in totality or is there a more efficient way to do this, than what I am doing right now?

Kind regards

You seem to be using a pretty slow GPU:

julia> @btime CUDA.@sync reorder_vectors!(sorted_indices, vec1, vec2, vec3)
  327.501 ΞΌs (2140 allocations: 73.47 KiB)

In isolation, the CUDA.jl sort doesn’t look that bad:

julia> @btime sortperm!($(rand(Int, 100_000)), $(rand(Float32, 100_000)));
  3.231 ms (2 allocations: 781.30 KiB)

julia> @btime CUDA.@sync sortperm!($(CUDA.rand(Int, 100_000)), $(CUDA.rand(Float32, 100_000)));
  214.269 ΞΌs (1550 allocations: 53.52 KiB)

However, there’s something weird with your reorder_vectors! function. The broadcasts in there seem to somehow speed up CPU execution:

julia> function reorder_vectors!(sorted_indices, vec1, vec2, vec3)
           sortperm!(sorted_indices, vec1)
           vec1 .= vec1[sorted_indices]
           return
       end
reorder_vectors! (generic function with 1 method)

julia> @btime reorder_vectors!($(rand(Int, 100_000)), $(rand(Float32, 100_000)), $(rand(Float32, 100_000)), $(rand(Float32, 100_000)));
  171.899 ΞΌs (2 allocations: 390.67 KiB)


julia> function reorder_vectors!(sorted_indices, vec1, vec2, vec3)
           sortperm!(sorted_indices, vec1)
           return
       end
reorder_vectors! (generic function with 1 method)

julia> @btime reorder_vectors!($(rand(Int, 100_000)), $(rand(Float32, 100_000)), $(rand(Float32, 100_000)), $(rand(Float32, 100_000)));
  3.432 ms (2 allocations: 781.30 KiB)

I don’t have the time to investigate further, but if you can figure out an MWE where the CUDA sort performs much worse than the CPU sort (on reasonable input sizes; n=100_000 is relatively small) feel free to open an issue on the CUDA.jl bug tracker.

1 Like

Hi yes, the GPU I am using is not the strongest - but this is the same for the CPU, both are quite a few years old. I made the example using N = 10 million and did both GPU and CPU. Full code here:

using CUDA
using BenchmarkTools

function reorder_vectors!(sorted_indices, vec1, vec2, vec3)
    sortperm!(sorted_indices, vec1)
    vec1 .= vec1[sorted_indices]
    vec2 .= vec2[sorted_indices]
    vec3 .= vec3[sorted_indices]
end

### GPU TEST
# Initialize vectors
n = 10_000_000
vec1 = CUDA.rand(n)
vec2 = CUDA.rand(n)
vec3 = CUDA.rand(n)
sorted_indices = CUDA.zeros(Int, n)

# Benchmark memory allocation and execution time
mem_allocated = CUDA.@allocated reorder_vectors!(sorted_indices, vec1, vec2, vec3)
execution_time = @benchmark CUDA.@sync reorder_vectors!($sorted_indices, $vec1, $vec2, $vec3)

println("GPU Memory allocated: $mem_allocated bytes")
display(execution_time)

### CPU Test
vec1 = Array(vec1)
vec2 = Array(vec2)
vec3 = Array(vec3)
sorted_indices = zeros(Int, n)

# Benchmark memory allocation and execution time
mem_allocated  = @allocated reorder_vectors!(sorted_indices, vec1, vec2, vec3)
execution_time = @benchmark reorder_vectors!($sorted_indices, $vec1, $vec2, $vec3)

println("CPU Memory allocated: $mem_allocated bytes")
display(execution_time)
###

And my GPU is surprisingly slow:

 Memory estimate: 132.50 KiB, allocs estimate: 3761.
GPU Memory allocated: 150000063 bytes
BenchmarkTools.Trial: 14 samples with 1 evaluation.
 Range (min … max):  374.892 ms … 380.100 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     377.190 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   377.275 ms Β±   1.557 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

While CPU seems to be a β€œbeast”:

CPU Memory allocated: 152644328 bytes
BenchmarkTools.Trial: 37 samples with 1 evaluation.
 Range (min … max):  126.612 ms … 146.965 ms  β”Š GC (min … max): 0.00% … 11.32%
 Time  (median):     140.002 ms               β”Š GC (median):    8.28%
 Time  (mean Β± Οƒ):   137.852 ms Β±   7.106 ms  β”Š GC (mean Β± Οƒ):  6.42% Β±  4.77%

      β–ƒ β–ƒ  β–ƒ                      β–ƒ           β–ƒ β–ƒ  β–ˆ   β–ˆ β–ƒβ–ƒ
  β–‡β–‡β–‡β–‡β–ˆβ–β–ˆβ–‡β–β–ˆβ–β–β–‡β–β–β–β–β–β–β–β–β–β–‡β–β–β–β–β–β–β–β–β–β–ˆβ–β–β–β–β–‡β–‡β–‡β–‡β–β–β–‡β–ˆβ–β–ˆβ–β–β–ˆβ–β–β–β–ˆβ–β–ˆβ–ˆβ–‡β–‡β–β–‡ ▁
  127 ms           Histogram: frequency by time          147 ms <

 Memory estimate: 114.44 MiB, allocs estimate: 6.

I am bit confused over these results, especially the amount of allocations, so I hope I’ve done a simple mistake somewhere.

I have opened an issue on the repo in a few seconds.

Kind regards

@maleadt I confirm the same (interesting!) pattern as you that, only doing sortperm! then GPU is faster.

image

But doing vec1 .= ... then CPU is faster ???

image

The difference in these benchmarks are simply to comment out vec1 .= ... lines. Both CPU and GPU is sped up by including the operation, quite weird.

I also put return nothing at the end of the function.

Kind regards