Hello!
I am getting quite confused. I have a vector of CartesianIndices and when I call sortperm! on it, it works great. I notice though that when tracking performance with CUDA.@time it reports to me a lot of CPU allocations. How come, isn’t the computation being performed on GPU, if the inputs to sortperm!
are CuArrays?
Kind regards
It does seem to CUDA.jl/src/sorting.jl at 9306cea23b813c3331df6749595c8b46d6c8b27c · JuliaGPU/CUDA.jl · GitHub.
A brief look at bitonic_sort!
seemed to suggest it might be doing a fair amount of stuff on the cpu as well as on the gpu.
1 Like
Thanks for sharing!
Yes, that is what perplexed me too, but from the code I can see what you describe a lot of allocations on CPU. I was surprised that it would be 10k+ allocations though. The speed of it is ok being about 160 mus for 3000 elements, considering that the GPU does not get a chance to scale properly on such a small vector.
Atleast I can now confirm something is happening on GPU, thanks!