How to avoid memory allocation while doing sum on a GPU?

I have been having a problem with memory allocation when using sum on a CUDA array. Is there a way to avoid it? I have pasted the relevant parts of the code below:

println("3:", CUDA.memory_status())
Cgain1_reshaped_for_sum .= sum(u_times_cth,dims=(2,))
println("4:", CUDA.memory_status())

The output of the code is given below:

3:nothing
Effective GPU memory usage: 2.43% (2.265 GiB/93.096 GiB)
Memory pool usage: 1.513 GiB (1.562 GiB reserved)
4:nothing
Effective GPU memory usage: 2.53% (2.358 GiB/93.096 GiB)
Memory pool usage: 1.597 GiB (1.656 GiB reserved)

This is not a large increase in memory allocation, but the above statements are in a loop, and the memory allocation continues to increase. I can solve this problem using GC.gc() at the beginning of each iteration, but then the code becomes very slow. Is there a way to avoid memory allocation when doing a sum on GPUs? It should be noted that Cgain1_reshaped_for_sum and u_times_cth are preallocated CUDA arrays. So I do not understand where the memory allocation is coming from. I tried this on A100 and H100 chips, and the result is the same.

Any suggestions on how to avoid allocation would be very much appreciated.

Here you’re first creating a new CuArray containing the sum, which you then copy over into Cgain1_reshaped_for_sum. Instead you can use sum!:

julia> using CUDA

julia> x = CUDA.rand(10^6, 10^3);

julia> y = CuArray{Float32}(undef, size(x, 1), 1);

julia> CUDA.@allocated y .= sum(x, dims=2)
4000000

julia> CUDA.@allocated sum!(y, x)
0

julia> y == sum(x, dims=2)
true

Thanks, but I tried it and it does not work because sum! does not allow the argument dims. So it only works for 1d arrays. If there is a 4d array, which is what I have, then I cannot sum over the second axis like I am doing. Is there a workaround to make sum! work for multi-dimensional arrays?

Well, in my example above, x is a 2D array :slight_smile: . The trick is to insert 1s into the shape of the preallocated output.

help?> sum!
search: sum! sum cumsum! put! dump

  sum!(r, A)

  Sum elements of A over the singleton dimensions of r, and write results to r.
(...)

You could also use reshape for that.

julia> x = CUDA.ones(2, 3, 4);

julia> y = CuArray{Float32}(undef, 2, 4);  # will contain sum(x, dims=2)

julia> sum!(reshape(y, (2, 1, 4)), x); y  
2×4 CuArray{Float32, 2, CUDA.DeviceMemory}:
 3.0  3.0  3.0  3.0
 3.0  3.0  3.0  3.0

Sorry about that. I did not understand the syntax for sum! but I am surprised that it does not resolve my problem. If I run the code below:

    u_reshaped = reshape(u, (dof, cbins, ebins, rbins))
    du_reshaped = reshape(du, (dof, cbins, ebins, rbins))
    Cgain0_reshaped .= reshape(sum(u_reshaped, dims=(2,)),(dof,ebins,rbins)) .* dctharr
    println("2: ",CUDA.memory_status())
    cdirchhere_reshaped = reshape(cdirchhere, (dof, ebins, rbins))
    bkpeakhere_reshaped = reshape(bkpeakhere, (rbins, 1, 1, 1))
    ctharr_reshaped = reshape(ctharr, (1, cbins, 1, 1))
    stharr_reshaped = reshape(stharr, (1, cbins, 1, 1))
    rarr_reshaped = reshape(rarr,(1,1,1,rbins))
    cgrad_prefactor .= -(stharr_reshaped .* stharr_reshaped ./ rarr_reshaped) / dctharr
    #println("3:", CUDA.memory_status())
    println(size(cgrad_prefactor))
    println(size(u_times_cth)," ",size(u_reshaped)," ",size(ctharr_reshaped))
    u_times_cth .= u_reshaped .* ctharr_reshaped
    println("3:", CUDA.memory_status())
    sum!(Cgain1_reshaped_for_sum,u_times_cth)
    println("4:", CUDA.memory_status())
    Cgain1_reshaped .= reshape(Cgain1_reshaped_for_sum, (dof, ebins, rbins)) .* dctharr

I get the output:

2: nothing
(1, 75, 1, 150)
(8, 75, 25, 150) (8, 75, 25, 150) (1, 75, 1, 1)
Effective GPU memory usage: 1.93% (1.796 GiB/93.096 GiB)
Memory pool usage: 1.184 GiB (1.219 GiB reserved)
3:nothing
Effective GPU memory usage: 1.93% (1.796 GiB/93.096 GiB)
Memory pool usage: 1.184 GiB (1.219 GiB reserved)
4:nothing
Effective GPU memory usage: 2.03% (1.890 GiB/93.096 GiB)
Memory pool usage: 1.267 GiB (1.312 GiB reserved)

There is clearly memory allocation happening between 3: and 4: but I do not understand why.

Hmm, strange… Could you extract a MWE?

Could it be Inconsistency between CUDA.jl and Base for sum! · Issue #2506 · JuliaGPU/CUDA.jl · GitHub?

When a reduction is large and happens in multiple steps, we allocate a temporary container: CUDA.jl/src/mapreduce.jl at 689f98587c2587614e13047fbcdd417b52749edd · JuliaGPU/CUDA.jl · GitHub. This could be the source, and could be avoided by doing atomic operations from the kernel instead. Or maybe we could add a call to unsafe_free! at the end of mapreducedim!, which should behave similarly to you calling GC.gc.

Also, better use CUDA.@time to report on the number/amount instead of parsing the diff between calls to CUDA.memory_status().