When running a kernel several times (3) I get the following exception (with -g2):
ERROR: Out of dynamic GPU memory (trying to allocate 64 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 64 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 64 bytes)
ERROR: a exception was thrown during kernel execution.
Stacktrace:
ERROR: a exception was thrown during kernel execution.
Stacktrace:
ERROR: a exception was thrown during kernel execution.
Stacktrace:
[1] gc_pool_alloc at /home/sftnight/.julia/packages/GPUCompiler/1Ajz2/src/runtime.jl:129
[1] gc_pool_alloc at /home/sftnight/.julia/packages/GPUCompiler/1Ajz2/src/runtime.jl:129
[1] gc_pool_alloc at /home/sftnight/.julia/packages/GPUCompiler/1Ajz2/src/runtime.jl:129
...
I tried to GC.gc(true); CUDA.reclaim() between executions but it does not help. I get the crash despite that the memory usage is reduced.
For information, I am using a CuArray of a Union of 3 structs which is the maximum number of types I can use for the time being (see Limitation in Union types with CUDA.jl?)
I there are way to get a real traceback to point to the problem?
Dynamic memory is memory allocated from within a kernel, and because of how CUDA works that memory is lost after the kernel exist. Basically, don’t allocate within a kernel. The support for that only exists to support some limited cases where we need to allocate an exception object before throwing it.
To find out where the allocations come from, inspect the LLVM code (@device_code_llvm) and look for calls to alloc-like functions. Escape analysis in 1.8/1.9 is going to improve this, but for now you might have to force-inline some functions or avoid passing complex object to complex functions.
Thank-you very much. I guess some allocations (i.e. temporary objects) have sneaked in the code that later runs in the kernel. I’ll follow your suggestion.
Check the LLVM IR (@device_code_llvm dump_module=true ...) and look for gpu_malloc calls. They can happen when an MArray allocation wasn’t properly optimized away by Julia, resulting in allocations. It happens because mutable StaticArrays are kinda problematic, in that they rely on a Julia optimization kicking in, which doesn’t always happen (as observed here). If you want to avoid running into this, use SArray with Base.setindex (i.e. the non-mutating version that returns a new object), which is less likely to run into this.
SArrays also allocate if the element type is abstract, even for small isbitsUnions because the backend Tuple is exceptionally covariant in its parameters and thus cannot do Memory’s inline element optimization. Worth looking out for mistakes in manually specified parameters in constructors, like SVector{2, Integer}, or inputs with abstract element types like SVector{1}(push!([], 1)).