How to sort an array based on another on GPU (CUDA) efficiently?

Ahmed_Salih · March 16, 2024, 10:16am

Hello!

I have the following toy-example below:

using CUDA
using BenchmarkTools

function reorder_vectors!(sorted_indices, vec1, vec2, vec3)
    CUDA.sortperm!(sorted_indices, vec1)
    vec1 .= vec1[sorted_indices]
    vec2 .= vec2[sorted_indices]
    vec3 .= vec3[sorted_indices]
end

# Initialize vectors
n = 100_000
vec1 = CUDA.rand(n)
vec2 = CUDA.rand(n)
vec3 = CUDA.rand(n)
sorted_indices = CUDA.zeros(Int, n)

# Benchmark memory allocation and execution time
mem_allocated = CUDA.@allocated reorder_vectors!(sorted_indices, vec1, vec2, vec3)
execution_time = @benchmark CUDA.@sync reorder_vectors!($sorted_indices, $vec1, $vec2, $vec3)

println("GPU Memory allocated: $mem_allocated bytes")
display(execution_time)

When I run it I get a quite slow (in my opinion) and allocating result on GPU:

GPU Memory allocated: 1500063 bytes
BenchmarkTools.Trial: 2584 samples with 1 evaluation.
 Range (min … max):  1.797 ms …  37.934 ms  ┊ GC (min … max): 0.00% … 33.40%
 Time  (median):     1.848 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.928 ms ± 989.787 μs  ┊ GC (mean ± σ):  0.50% ±  0.92%

    ▂▆█▅▁
  ▂▄█████▅▅▃▃▃▃▃▃▂▃▃▂▃▂▃▂▃▂▂▂▂▂▃▃▃▃▃▃▃▃▂▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂▂▂ ▃
  1.8 ms          Histogram: frequency by time        2.34 ms <

 Memory estimate: 53.36 KiB, allocs estimate: 1489.

Is this an anti-pattern on GPU and sorting based on another set of indices should be avoided in totality or is there a more efficient way to do this, than what I am doing right now?

Kind regards

maleadt · March 17, 2024, 9:34am

You seem to be using a pretty slow GPU:

julia> @btime CUDA.@sync reorder_vectors!(sorted_indices, vec1, vec2, vec3)
  327.501 μs (2140 allocations: 73.47 KiB)

In isolation, the CUDA.jl sort doesn’t look that bad:

julia> @btime sortperm!($(rand(Int, 100_000)), $(rand(Float32, 100_000)));
  3.231 ms (2 allocations: 781.30 KiB)

julia> @btime CUDA.@sync sortperm!($(CUDA.rand(Int, 100_000)), $(CUDA.rand(Float32, 100_000)));
  214.269 μs (1550 allocations: 53.52 KiB)

However, there’s something weird with your reorder_vectors! function. The broadcasts in there seem to somehow speed up CPU execution:

julia> function reorder_vectors!(sorted_indices, vec1, vec2, vec3)
           sortperm!(sorted_indices, vec1)
           vec1 .= vec1[sorted_indices]
           return
       end
reorder_vectors! (generic function with 1 method)

julia> @btime reorder_vectors!($(rand(Int, 100_000)), $(rand(Float32, 100_000)), $(rand(Float32, 100_000)), $(rand(Float32, 100_000)));
  171.899 μs (2 allocations: 390.67 KiB)


julia> function reorder_vectors!(sorted_indices, vec1, vec2, vec3)
           sortperm!(sorted_indices, vec1)
           return
       end
reorder_vectors! (generic function with 1 method)

julia> @btime reorder_vectors!($(rand(Int, 100_000)), $(rand(Float32, 100_000)), $(rand(Float32, 100_000)), $(rand(Float32, 100_000)));
  3.432 ms (2 allocations: 781.30 KiB)

I don’t have the time to investigate further, but if you can figure out an MWE where the CUDA sort performs much worse than the CPU sort (on reasonable input sizes; n=100_000 is relatively small) feel free to open an issue on the CUDA.jl bug tracker.

Ahmed_Salih · March 17, 2024, 7:36pm

Hi yes, the GPU I am using is not the strongest - but this is the same for the CPU, both are quite a few years old. I made the example using N = 10 million and did both GPU and CPU. Full code here:

using CUDA
using BenchmarkTools

function reorder_vectors!(sorted_indices, vec1, vec2, vec3)
    sortperm!(sorted_indices, vec1)
    vec1 .= vec1[sorted_indices]
    vec2 .= vec2[sorted_indices]
    vec3 .= vec3[sorted_indices]
end

### GPU TEST
# Initialize vectors
n = 10_000_000
vec1 = CUDA.rand(n)
vec2 = CUDA.rand(n)
vec3 = CUDA.rand(n)
sorted_indices = CUDA.zeros(Int, n)

# Benchmark memory allocation and execution time
mem_allocated = CUDA.@allocated reorder_vectors!(sorted_indices, vec1, vec2, vec3)
execution_time = @benchmark CUDA.@sync reorder_vectors!($sorted_indices, $vec1, $vec2, $vec3)

println("GPU Memory allocated: $mem_allocated bytes")
display(execution_time)

### CPU Test
vec1 = Array(vec1)
vec2 = Array(vec2)
vec3 = Array(vec3)
sorted_indices = zeros(Int, n)

# Benchmark memory allocation and execution time
mem_allocated  = @allocated reorder_vectors!(sorted_indices, vec1, vec2, vec3)
execution_time = @benchmark reorder_vectors!($sorted_indices, $vec1, $vec2, $vec3)

println("CPU Memory allocated: $mem_allocated bytes")
display(execution_time)
###

And my GPU is surprisingly slow:

 Memory estimate: 132.50 KiB, allocs estimate: 3761.
GPU Memory allocated: 150000063 bytes
BenchmarkTools.Trial: 14 samples with 1 evaluation.
 Range (min … max):  374.892 ms … 380.100 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     377.190 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   377.275 ms ±   1.557 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

While CPU seems to be a “beast”:

CPU Memory allocated: 152644328 bytes
BenchmarkTools.Trial: 37 samples with 1 evaluation.
 Range (min … max):  126.612 ms … 146.965 ms  ┊ GC (min … max): 0.00% … 11.32%
 Time  (median):     140.002 ms               ┊ GC (median):    8.28%
 Time  (mean ± σ):   137.852 ms ±   7.106 ms  ┊ GC (mean ± σ):  6.42% ±  4.77%

      ▃ ▃  ▃                      ▃           ▃ ▃  █   █ ▃▃
  ▇▇▇▇█▁█▇▁█▁▁▇▁▁▁▁▁▁▁▁▁▇▁▁▁▁▁▁▁▁▁█▁▁▁▁▇▇▇▇▁▁▇█▁█▁▁█▁▁▁█▁██▇▇▁▇ ▁
  127 ms           Histogram: frequency by time          147 ms <

 Memory estimate: 114.44 MiB, allocs estimate: 6.

I am bit confused over these results, especially the amount of allocations, so I hope I’ve done a simple mistake somewhere.

I have opened an issue on the repo in a few seconds.

Kind regards

Ahmed_Salih · March 17, 2024, 7:44pm

@maleadt I confirm the same (interesting!) pattern as you that, only doing sortperm! then GPU is faster.

But doing vec1 .= ... then CPU is faster ???

The difference in these benchmarks are simply to comment out vec1 .= ... lines. Both CPU and GPU is sped up by including the operation, quite weird.

I also put return nothing at the end of the function.

Kind regards

Topic		Replies	Views
GPU sort WIP (GPU 1000x faster than CPU? I must be doing something wrong) GPU sort	12	2188	January 31, 2019
Faster Sorting With GPU? New to Julia gpu , sort , arrays	25	2344	January 17, 2018
`CUDA.quicksort` is not faster than CPU radixsort GPU performance	3	1259	July 4, 2021
Does CUDA.jl have sortperm! implemented? General Usage	2	226	March 14, 2024
How to sort a GPU StructArray vector by a subfield and update location of other elements in place? GPU	0	122	April 29, 2024

How to sort an array based on another on GPU (CUDA) efficiently?

Related topics