How to copy view of CuArray to Array efficiently?

0samuraiE · October 5, 2024, 5:26pm

Hi all. For some reason, I encountered a GPU allocation problem when I copied a CuArray view. I tried several copying methods, but the third result was much slower.
Should I avoid copying the CuArray view to the host?

  0.001308 seconds (6 CPU allocations: 96 bytes)

  0.001590 seconds (46 CPU allocations: 1.344 KiB) (1 GPU allocation: 30.518 MiB, 0.56% memmgmt time)

  0.181835 seconds (200.00 k CPU allocations: 3.052 MiB)

using CUDA

A1 = CUDA.rand(202, 202, 202);
B1 = CUDA.pin(zeros(eltype(A1), size(A1)));
copyto!(B1, A1)
B1 == Array(A1)

ind = CartesianIndices((200, 200, 200))
A2 = @view CUDA.rand(202, 202, 202)[ind];
B2 = CUDA.pin(zeros(eltype(A2), size(A2)));
copyto!(B2, A2)
B2 == Array(A2)

ind = CartesianIndices((200, 200, 200))
A3 = CUDA.rand(202, 202, 202);
B3 = CUDA.pin(zeros(eltype(A3), size(ind)));
copyto!(B3, ind, A3, ind)
B3[ind] == Array(A3[ind])

CUDA.@time CUDA.@sync copyto!(B1, A1);
CUDA.@time CUDA.@sync copyto!(B2, A2);
CUDA.@time CUDA.@sync copyto!(B3, ind, A3, ind);

Ahmed_Salih · October 6, 2024, 8:22am

In general when coding on GPU, you should not copy the arrays to host (CPU) until you actually need to do so. The longer you can keep it on GPU, the faster your overall code will be.

Transferring data between GPU and CPU all the time, will slow down your application, especially if the tasks are on a smaller scale.

If your GPU calc takes 1000 s and then you need 0.18s to copy over, then it will not be a bottle neck.

Kind regards

0samuraiE · October 6, 2024, 11:00am

Thank you. But this problem is a gpu allocation problem, not a time problem. GPU allocation in 2nd example is as same size as CuArray to copy.
It seems like a waste of valuable GPU memory.
the 3rd is the best after all.

maleadt · October 6, 2024, 11:01am

If your view is non-contiguous, it will be squashed into a contiguous CuArray first before copying to the host Array using an API call. Alternatives are possible, such as performing multiple API calls to copy each contiguous slice, or by using a CuArray representing the host array (e.g., using unsafe_wrap(CuArray, ::Array)) and performing a broadcast assignment. None of these are guaranteed to improve performance in all cases though, so we default to the simplest solution, which is to allocate a temporary CuArray.

0samuraiE · October 6, 2024, 11:08am

Thanks everyone for consideration, I will take the third way.

Topic		Replies	Views
Inplace array modification performances GPU	4	619	July 20, 2022
The performance difference of transferring (SubArray, ReshapedArray) Array to GPU GPU flux	2	644	November 20, 2019
Problems copying data on the GPU GPU	8	1236	October 1, 2020
Dot-product of CuArray views is slow GPU performance , memory-allocation , views	10	1539	May 11, 2021
Using @view with CuArrays GPU	6	1155	September 20, 2023

How to copy view of CuArray to Array efficiently?

Related topics