CUDA(.jl) memory errors for very large kernels

AntonReinhard · April 17, 2025, 7:57am

The PhaseSpacePoint object is essentially a collection of SFourMomentum objects, stored in tuples and some singletons for meta information. SFourMomentum itself is an object with 4 simple data members, but behaves like a 4-SVector. The important part here should only be that it’s entirely isbits.

The example I gave should pretty closely match the original kernel, in that it pushes objects to the GPU (allocated in Vector/CuVector, just like my original example), which themselves contain SVector-like objects, in the example case BiSpinor/AdjointBiSpinor/DiracMatrix. I didn’t push SVectors of input to the GPU in either case.

The problem isn’t evaluating the kernel I originally posted once. It has to be evaluated millions/billions of times to be useful (in the end it will be used for many-dimensional Monte Carlo integration). So currently I have this one big monolithic kernel that evaluates f(psp::PhaseSpacePoint) = y::Float, and I call this on some large [Cu]Vector of PhaseSpacePoints. As we’ve seen, f can get arbitrarily big depending on the scattering process I’m considering, so I’m looking for ways to break the kernel into smaller parts to be able to parallelize inside f, not just across it.
I’m not sure if this is the way to go, since this would introduce communication between threads (and lots more code complexity), which is usually something you want to avoid on GPUs.

MrBelette · April 21, 2025, 11:18pm

I suspect that all of the local load/stores are due to allocations that haven’t been optimized away. In your example, the product of a Bispinor and AdjointBiSpinor is a DiracMatrix. The function _mul returns a new DiracMatrix(BS * transpose(aBS)).
The (roughly) CPU-equivalent code for your example is:

function testmul(in1, in2, out, n)
    @inbounds A = in1[n]
    @inbounds B = in2[n]
    @inbounds out[n] = A*B
end

If one uses Cthulhu.jl with inline_cost=true, then one sees that the inline cost for _mul is 128.

This exceeds the standard cost threshold of 100, so the function call isn’t automatically inlined. Please see Inlining 101

I believe that when you use always_inline=true, or manually annotate a method with @inline, the inlining cost threshold is increased 20x i.e. inline_cost_threshold becomes 2000 instead of 100 (for each method). So, your simple case results in inlining, but it may not be true for larger cases, especially if there are hidden dynamic dispatches. Certain statements are hit with a inline_nonleaf_penalty=1000, for example. C.f. julia/base/compiler/optimize.jl at a0ff94433010eeb5d1f3297fb4fffdf4a5ef5f7e · JuliaLang/julia · GitHub

So, I’m wondering if a better strategy for you might be to manually annotate more of your methods with @inline? For example, you can avoid requiring always_inline=true in your simple example if you annotate the _mul functions within multiplication.jl. You can then slowly increase the complexity of your models in order to discover where inlining starts to fail again, and then take appropriate action.

islent · April 22, 2025, 7:51am

For me, this error often happens when the temporal variables are not immediately released, especially in the case of carrying out fft.

I solve this by manually release the GPU memory using CUDA.unsafe_free!.

a = cu(rand(512, 512, 512))

a = nothing  # this will not GC immediately

CUDA.unsafe_free!(a)  # GPU memory is immediately released. And you should never use variable a before redefining it

AntonReinhard · April 22, 2025, 4:35pm

It looks like one of the inhibitors to the inlining might be from LinearAlgebra. For example:

*(x::LinearAlgebra.Transpose{T, <:AbstractVector} where T, A::AbstractMatrix) @ LinearAlgebra ~/.julia/juliaup/julia-1.11.5+0.x64.linux.gnu/share/julia/stdlib/v1.11/LinearAlgebra/src/matmul.jl:96
   ∘ ─ 128 %0 = invoke *(::LinearAlgebra.Transpose{Float64, StaticArraysCore.SVector{4, Float64}},::StaticArraysCore.SMatrix{4, 4, Float64, 16})::LinearAlgebra.Transpose{Float64, StaticArraysCore.SVector{4, Float64}}

I can @inline the expression, but it still doesn’t seem to inline all the way…

MrBelette · April 22, 2025, 6:45pm

Strange, it’s not what I’m seeing for your test kernel. Have you toggled [o]ptimize?

Here’s what I see for your test kernel (hopefully it’s not too small):

Furthermore, I don’t see any local load/stores for that kernel when I use @device_code_ptx without using always_inline=true, and I inline the _mul statements. Same for sass code:

K = @cuda launch=true always_inline=false kernel(in1, in2, out, N)
@show CUDA.memory(K)
@show CUDA.registers(K)

gives:

Topic		Replies	Views
Strange CUDA_ERROR_OUT_OF_MEMORY when using CUDA on HPC GPU	5	510	November 23, 2020
Out of dynamic GPU memory? GPU	5	1412	February 21, 2022
Significant CUDA.jl memory allocations outside of main pool? GPU memory	2	1397	August 6, 2022
Explanation of Error related to memory GPU gpu , cuda , ijulia	0	351	April 27, 2021
Could CUDA.jl be better at explaining what went wrong GPU	8	595	January 30, 2023

CUDA(.jl) memory errors for very large kernels

Related topics