Inplace array modification performances


I am trying to improve the memory management of some complex iterative procedure.
Basically the procedure could be synthesized to
for ii = 1:Nt
do some partial arrays value update
Compute some stuff (include fft ffti and other)

I have some question about partial assignment of CuArrays and performances.
I noticed that some code, which does not perform allocations in CPU, presents allocations on GPU.
Researching on the forum has taught me that setindex! may not be the best with arguments of type view so I went for other approaches.
The speed of computation is good in all the approaches, but considering that my actual set up implies iterative process on several large matrices, the memory management becomes a limiting factor.

How could I update an array with values of another array without making allocations?

using CUDA
using BenchmarkTools

Nx = 200
Ny = 200
Nz = 500

data = rand(Nx, Ny, Nz)
Cu_p = CUDA.zeros(Float32,Nx, Ny, Nz)                # pressure
Cu_tr_data = CuArray(data)               # pressure

t_index = 1

pos_x = 1 : Nx
pos_y = 1 : Ny
pos_z = 1

function func1(Cu_p, pos_x, pos_y, pos_z, Cu_tr_data, t_index)
    Cu_p[pos_x, pos_y, pos_z] .= view(Cu_tr_data, :, :, t_index)

function func2(Cu_p, pos_x, pos_y, pos_z, Cu_tr_data, t_index)
    copyto!(Cu_p[pos_x, pos_y, pos_z], view(Cu_tr_data, :, :, t_index))

source = view(Cu_p, pos_x, pos_y, pos_z)

function func3(source, Cu_tr_data, t_index)
    source .= view(Cu_tr_data, :, :, t_index)
function func4(source, Cu_tr_data, t_index)
    copyto!(source, view(Cu_tr_data, :, :, t_index))

@benchmark CUDA.@sync func1($Cu_p, $pos_x, $pos_y, $pos_z, $Cu_tr_data, $t_index)
@benchmark CUDA.@sync func2($Cu_p, $pos_x, $pos_y, $pos_z, $Cu_tr_data, $t_index)
@benchmark CUDA.@sync func3($source, $Cu_tr_data, $t_index)
@benchmark CUDA.@sync func4($source, $Cu_tr_data, $t_index)

Launching a kernel requires a allocations.

The allocations you report here are tiny, esp. compared to the total execution times. Are the allocations a problem, or why are you trying to avoid them?

1 Like

To make it short one iteration of my process, with the smaller array size I work with, takes about 0.07 sec per iteration, but I get a peak in execution time every 3 iteration step or so when gc is called. Ideally I would work with arrays 4 to 8 times larger.
Given there is several hundred of iteration steps, I feel there is somewhere to gain time. I investigated towards minimizing overall allocations to better handle my memory and reduce gc calls.
Every computation step is made inplace, all CuArrays are created at the beginning of the program, though I still have allocations.

You are looking at CPU allocations. Use CUDA.@time to report GPU allocations.


Sorry for the late response, I could not have access to my GPU earlier.
Indeed, I misunderstood the output of the @benchmark. Switching to CUDA.@time shows indeed no GPU allocations (except the copyto!(xx, view()) form, but I saw an other post on the forum talking about that).
Thanks again.