Out of dynamic GPU memory?

When running a kernel several times (3) I get the following exception (with -g2):

ERROR: Out of dynamic GPU memory (trying to allocate 64 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 64 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 64 bytes)
ERROR: a exception was thrown during kernel execution.
Stacktrace:
ERROR: a exception was thrown during kernel execution.
Stacktrace:
ERROR: a exception was thrown during kernel execution.
Stacktrace:
 [1] gc_pool_alloc at /home/sftnight/.julia/packages/GPUCompiler/1Ajz2/src/runtime.jl:129
 [1] gc_pool_alloc at /home/sftnight/.julia/packages/GPUCompiler/1Ajz2/src/runtime.jl:129
 [1] gc_pool_alloc at /home/sftnight/.julia/packages/GPUCompiler/1Ajz2/src/runtime.jl:129
...

I tried to GC.gc(true); CUDA.reclaim() between executions but it does not help. I get the crash despite that the memory usage is reduced.

julia> CUDA.memory_status()
Effective GPU memory usage: 1.41% (209.938 MiB/14.561 GiB)
Memory pool usage: 80.176 KiB (32.000 MiB reserved)
julia> GC.gc(true); CUDA.reclaim()
julia> CUDA.memory_status()
Effective GPU memory usage: 1.19% (177.938 MiB/14.561 GiB)
Memory pool usage: 0 bytes (0 bytes reserved)

For information, I am using a CuArray of a Union of 3 structs which is the maximum number of types I can use for the time being (see Limitation in Union types with CUDA.jl?)
I there are way to get a real traceback to point to the problem?

Dynamic memory is memory allocated from within a kernel, and because of how CUDA works that memory is lost after the kernel exist. Basically, don’t allocate within a kernel. The support for that only exists to support some limited cases where we need to allocate an exception object before throwing it.

To find out where the allocations come from, inspect the LLVM code (@device_code_llvm) and look for calls to alloc-like functions. Escape analysis in 1.8/1.9 is going to improve this, but for now you might have to force-inline some functions or avoid passing complex object to complex functions.

2 Likes

Thank-you very much. I guess some allocations (i.e. temporary objects) have sneaked in the code that later runs in the kernel. I’ll follow your suggestion.

In my case it was due to the use of StaticArrays.MVector in kernels. Inlining functions used by the kernel helped eliminate the allocations.

1 Like

A similar case was noted here, Regression in memory allocation optimization of a mutable StaticArray · Issue #41800 · JuliaLang/julia · GitHub, which should be fixed in the upcoming 1.8.

1 Like

Indeed I am using StaticArrays.MVector ink the kernel. I didn’t know any other way to have a mutable fix length vector.