Inplace array modification performances

dgasteau · July 11, 2022, 10:17am

Hi,

I am trying to improve the memory management of some complex iterative procedure.
Basically the procedure could be synthesized to
for ii = 1:Nt
do some partial arrays value update
Compute some stuff (include fft ffti and other)
end

I have some question about partial assignment of CuArrays and performances.
I noticed that some code, which does not perform allocations in CPU, presents allocations on GPU.
Researching on the forum has taught me that setindex! may not be the best with arguments of type view so I went for other approaches.
The speed of computation is good in all the approaches, but considering that my actual set up implies iterative process on several large matrices, the memory management becomes a limiting factor.

How could I update an array with values of another array without making allocations?

using CUDA
using BenchmarkTools

Nx = 200
Ny = 200
Nz = 500

data = rand(Nx, Ny, Nz)
Cu_p = CUDA.zeros(Float32,Nx, Ny, Nz)                # pressure
Cu_tr_data = CuArray(data)               # pressure

t_index = 1

pos_x = 1 : Nx
pos_y = 1 : Ny
pos_z = 1


function func1(Cu_p, pos_x, pos_y, pos_z, Cu_tr_data, t_index)
    Cu_p[pos_x, pos_y, pos_z] .= view(Cu_tr_data, :, :, t_index)
end

function func2(Cu_p, pos_x, pos_y, pos_z, Cu_tr_data, t_index)
    copyto!(Cu_p[pos_x, pos_y, pos_z], view(Cu_tr_data, :, :, t_index))
end

source = view(Cu_p, pos_x, pos_y, pos_z)

function func3(source, Cu_tr_data, t_index)
    source .= view(Cu_tr_data, :, :, t_index)
end
function func4(source, Cu_tr_data, t_index)
    copyto!(source, view(Cu_tr_data, :, :, t_index))
end

@benchmark CUDA.@sync func1($Cu_p, $pos_x, $pos_y, $pos_z, $Cu_tr_data, $t_index)
@benchmark CUDA.@sync func2($Cu_p, $pos_x, $pos_y, $pos_z, $Cu_tr_data, $t_index)
@benchmark CUDA.@sync func3($source, $Cu_tr_data, $t_index)
@benchmark CUDA.@sync func4($source, $Cu_tr_data, $t_index)

maleadt · July 11, 2022, 1:30pm

Launching a kernel requires a allocations.

The allocations you report here are tiny, esp. compared to the total execution times. Are the allocations a problem, or why are you trying to avoid them?

dgasteau · July 11, 2022, 1:46pm

To make it short one iteration of my process, with the smaller array size I work with, takes about 0.07 sec per iteration, but I get a peak in execution time every 3 iteration step or so when gc is called. Ideally I would work with arrays 4 to 8 times larger.
Given there is several hundred of iteration steps, I feel there is somewhere to gain time. I investigated towards minimizing overall allocations to better handle my memory and reduce gc calls.
Every computation step is made inplace, all CuArrays are created at the beginning of the program, though I still have allocations.

maleadt · July 12, 2022, 6:35am

You are looking at CPU allocations. Use CUDA.@time to report GPU allocations.

dgasteau · July 20, 2022, 7:09am

Sorry for the late response, I could not have access to my GPU earlier.
Indeed, I misunderstood the output of the @benchmark. Switching to CUDA.@time shows indeed no GPU allocations (except the copyto!(xx, view()) form, but I saw an other post on the forum talking about that).
Thanks again.

Topic		Replies	Views
Calculate FFT on GPU for every row of a 2D array Performance gpu	2	1043	August 14, 2018
Dot-product of CuArray views is slow GPU performance , memory-allocation , views	10	1569	May 11, 2021
How do I get allocations down but keep code speed? GPU	9	358	January 28, 2023
What is the optimal way of updating CuArray? GPU cudanative	7	1543	July 5, 2018
Performance of view with cuArrays GPU	11	2707	November 11, 2018

Inplace array modification performances

Related topics