Hello!
I have the following toy-example below:
using CUDA
using BenchmarkTools
function reorder_vectors!(sorted_indices, vec1, vec2, vec3)
CUDA.sortperm!(sorted_indices, vec1)
vec1 .= vec1[sorted_indices]
vec2 .= vec2[sorted_indices]
vec3 .= vec3[sorted_indices]
end
# Initialize vectors
n = 100_000
vec1 = CUDA.rand(n)
vec2 = CUDA.rand(n)
vec3 = CUDA.rand(n)
sorted_indices = CUDA.zeros(Int, n)
# Benchmark memory allocation and execution time
mem_allocated = CUDA.@allocated reorder_vectors!(sorted_indices, vec1, vec2, vec3)
execution_time = @benchmark CUDA.@sync reorder_vectors!($sorted_indices, $vec1, $vec2, $vec3)
println("GPU Memory allocated: $mem_allocated bytes")
display(execution_time)
When I run it I get a quite slow (in my opinion) and allocating result on GPU:
GPU Memory allocated: 1500063 bytes
BenchmarkTools.Trial: 2584 samples with 1 evaluation.
Range (min β¦ max): 1.797 ms β¦ 37.934 ms β GC (min β¦ max): 0.00% β¦ 33.40%
Time (median): 1.848 ms β GC (median): 0.00%
Time (mean Β± Ο): 1.928 ms Β± 989.787 ΞΌs β GC (mean Β± Ο): 0.50% Β± 0.92%
ββββ
β
ββββββββ
β
ββββββββββββββββββββββββββββββββββββββββββββββββββ β
1.8 ms Histogram: frequency by time 2.34 ms <
Memory estimate: 53.36 KiB, allocs estimate: 1489.
Is this an anti-pattern on GPU and sorting based on another set of indices should be avoided in totality or is there a more efficient way to do this, than what I am doing right now?
Kind regards