I have been having a problem with memory allocation when using sum on a CUDA array. Is there a way to avoid it? I have pasted the relevant parts of the code below:
println("3:", CUDA.memory_status())
Cgain1_reshaped_for_sum .= sum(u_times_cth,dims=(2,))
println("4:", CUDA.memory_status())
The output of the code is given below:
3:nothing
Effective GPU memory usage: 2.43% (2.265 GiB/93.096 GiB)
Memory pool usage: 1.513 GiB (1.562 GiB reserved)
4:nothing
Effective GPU memory usage: 2.53% (2.358 GiB/93.096 GiB)
Memory pool usage: 1.597 GiB (1.656 GiB reserved)
This is not a large increase in memory allocation, but the above statements are in a loop, and the memory allocation continues to increase. I can solve this problem using GC.gc() at the beginning of each iteration, but then the code becomes very slow. Is there a way to avoid memory allocation when doing a sum on GPUs? It should be noted that Cgain1_reshaped_for_sum and u_times_cth are preallocated CUDA arrays. So I do not understand where the memory allocation is coming from. I tried this on A100 and H100 chips, and the result is the same.
Any suggestions on how to avoid allocation would be very much appreciated.