How to copy view of CuArray to Array efficiently?

Hi all. For some reason, I encountered a GPU allocation problem when I copied a CuArray view. I tried several copying methods, but the third result was much slower.
Should I avoid copying the CuArray view to the host?

  0.001308 seconds (6 CPU allocations: 96 bytes)

  0.001590 seconds (46 CPU allocations: 1.344 KiB) (1 GPU allocation: 30.518 MiB, 0.56% memmgmt time)

  0.181835 seconds (200.00 k CPU allocations: 3.052 MiB)
using CUDA

A1 = CUDA.rand(202, 202, 202);
B1 = CUDA.pin(zeros(eltype(A1), size(A1)));
copyto!(B1, A1)
B1 == Array(A1)

ind = CartesianIndices((200, 200, 200))
A2 = @view CUDA.rand(202, 202, 202)[ind];
B2 = CUDA.pin(zeros(eltype(A2), size(A2)));
copyto!(B2, A2)
B2 == Array(A2)

ind = CartesianIndices((200, 200, 200))
A3 = CUDA.rand(202, 202, 202);
B3 = CUDA.pin(zeros(eltype(A3), size(ind)));
copyto!(B3, ind, A3, ind)
B3[ind] == Array(A3[ind])

CUDA.@time CUDA.@sync copyto!(B1, A1);
CUDA.@time CUDA.@sync copyto!(B2, A2);
CUDA.@time CUDA.@sync copyto!(B3, ind, A3, ind);
1 Like

In general when coding on GPU, you should not copy the arrays to host (CPU) until you actually need to do so. The longer you can keep it on GPU, the faster your overall code will be.

Transferring data between GPU and CPU all the time, will slow down your application, especially if the tasks are on a smaller scale.

If your GPU calc takes 1000 s and then you need 0.18s to copy over, then it will not be a bottle neck.

Kind regards

1 Like

Thank you. But this problem is a gpu allocation problem, not a time problem. GPU allocation in 2nd example is as same size as CuArray to copy.
It seems like a waste of valuable GPU memory.

the 3rd is the best after all.

If your view is non-contiguous, it will be squashed into a contiguous CuArray first before copying to the host Array using an API call. Alternatives are possible, such as performing multiple API calls to copy each contiguous slice, or by using a CuArray representing the host array (e.g., using unsafe_wrap(CuArray, ::Array)) and performing a broadcast assignment. None of these are guaranteed to improve performance in all cases though, so we default to the simplest solution, which is to allocate a temporary CuArray.

Thanks everyone for consideration, I will take the third way.