I am trying to improve the memory management of some complex iterative procedure.
Basically the procedure could be synthesized to
for ii = 1:Nt
do some partial arrays value update
Compute some stuff (include fft ffti and other)
I have some question about partial assignment of CuArrays and performances.
I noticed that some code, which does not perform allocations in CPU, presents allocations on GPU.
Researching on the forum has taught me that setindex! may not be the best with arguments of type view so I went for other approaches.
The speed of computation is good in all the approaches, but considering that my actual set up implies iterative process on several large matrices, the memory management becomes a limiting factor.
How could I update an array with values of another array without making allocations?
using CUDA
using BenchmarkTools
Nx = 200
Ny = 200
Nz = 500
data = rand(Nx, Ny, Nz)
Cu_p = CUDA.zeros(Float32,Nx, Ny, Nz) # pressure
Cu_tr_data = CuArray(data) # pressure
t_index = 1
pos_x = 1 : Nx
pos_y = 1 : Ny
pos_z = 1
function func1(Cu_p, pos_x, pos_y, pos_z, Cu_tr_data, t_index)
Cu_p[pos_x, pos_y, pos_z] .= view(Cu_tr_data, :, :, t_index)
function func2(Cu_p, pos_x, pos_y, pos_z, Cu_tr_data, t_index)
copyto!(Cu_p[pos_x, pos_y, pos_z], view(Cu_tr_data, :, :, t_index))
source = view(Cu_p, pos_x, pos_y, pos_z)
function func3(source, Cu_tr_data, t_index)
source .= view(Cu_tr_data, :, :, t_index)
function func4(source, Cu_tr_data, t_index)
copyto!(source, view(Cu_tr_data, :, :, t_index))
@benchmark CUDA.@sync func1($Cu_p, $pos_x, $pos_y, $pos_z, $Cu_tr_data, $t_index)
@benchmark CUDA.@sync func2($Cu_p, $pos_x, $pos_y, $pos_z, $Cu_tr_data, $t_index)
@benchmark CUDA.@sync func3($source, $Cu_tr_data, $t_index)
@benchmark CUDA.@sync func4($source, $Cu_tr_data, $t_index)