CUDA(.jl) memory errors for very large kernels

I have some questions about what some CUDA errors mean and what I can do about them.

I’m dynamically generating code that I can successfully compile for both CPU and GPU targets (the GPU I’m using here is an A30 with 24GiB of device memory). This generated code becomes very large. For example, one of my functions that runs fine on the GPU has

  • CUDA.memory(kernel)(local = 210896, shared = 0, constant = 0),
  • CUDA.registers(kernel)255,
  • ~39k lines of PTX code.

The input to the kernel is very small (well under 1kB per element) in comparison to the device’s memory and should barely have any effect.

Errors start to happen at the next larger functions I can generate. For a function:

  • CUDA.memory(kernel)(local = 403592, shared = 0, constant = 0),
  • CUDA.registers(kernel)255,
  • ~78k lines of PTX code,

an attempt to run the kernel immediately fails with:

Out of GPU memory
Effective GPU memory usage: 0.42% (101.000 MiB/23.599 GiB)
Memory pool usage: 8.250 KiB (32.000 MiB reserved)

Stacktrace:
  [1] throw_api_error(res::CUDA.cudaError_enum)
    @ CUDA ~/.julia/packages/CUDA/1kIOw/lib/cudadrv/libcuda.jl:28
  [2] check
    @ ~/.julia/packages/CUDA/1kIOw/lib/cudadrv/libcuda.jl:37 [inlined]
  [3] cuLaunchKernel
    @ ~/.julia/packages/CUDA/1kIOw/lib/utils/call.jl:34 [inlined]
  [4] (::CUDA.var"#966#967"{Bool, Int64, CuStream, CuFunction, CuDim3, CuDim3})(kernelParams::Vector{Ptr{Nothing}})
    @ CUDA ~/.julia/packages/CUDA/1kIOw/lib/cudadrv/execution.jl:66
  [5] macro expansion
    @ ~/.julia/packages/CUDA/1kIOw/lib/cudadrv/execution.jl:33 [inlined]
...

This doesn’t make much sense to me at all. It looks like there should be well enough memory available. I started the kernel with 32 threads and 1 block, so 403592B * 32 should only be about 12MiB, far from the available 24GiB. Is it somehow the kernel size itself or is there some other reason?

Additionally, an even larger function fails differently:

  • CUDA.memory(kernel)(local = 1735680, shared = 0, constant = 0),
  • CUDA.registers(kernel)255,
  • ~320k lines of PTX code.

This call fails with an even less meaningful error message:

CUDA error: invalid argument (code 1, ERROR_INVALID_VALUE)

Stacktrace:
  [1] throw_api_error(res::CUDA.cudaError_enum)
    @ CUDA ~/.julia/packages/CUDA/1kIOw/lib/cudadrv/libcuda.jl:30
  [2] check
    @ ~/.julia/packages/CUDA/1kIOw/lib/cudadrv/libcuda.jl:37 [inlined]
  [3] cuLaunchKernel
    @ ~/.julia/packages/CUDA/1kIOw/lib/utils/call.jl:34 [inlined]
  [4] (::CUDA.var"#966#967"{Bool, Int64, CuStream, CuFunction, CuDim3, CuDim3})(kernelParams::Vector{Ptr{Nothing}})
    @ CUDA ~/.julia/packages/CUDA/1kIOw/lib/cudadrv/execution.jl:66
  [5] macro expansion
    @ ~/.julia/packages/CUDA/1kIOw/lib/cudadrv/execution.jl:33 [inlined]
...

Note that both these functions compile just fine, in a minute or two (using @cuda launch=false .... They only fail when trying to run them.

And finally, if I use the always_inline = true option for the @cuda calls, I get a function:

  • CUDA.memory(kernel)(local = 33144, shared = 0, constant = 0),
  • CUDA.registers(kernel)255,
  • ~127k lines of PTX code.

This function runs fine, even though it has substantially more lines of code than the first failing one. The only metric that is way better is the local memory usage. While it’s nice that this works, I don’t want to be using always_inline everywhere because it also increases the code size.

So the question is, what exactly do these error messages mean, do they come from CUDA.jl or from libcuda, and are they fixable (assuming I can’t make the function any smaller)?

1 Like

The errors seem a bit misleading.
I’d assume you may be running out of shared memory or registers.
Have you checked that?

If there’s a way you can provide MWE it’d be helpful.

I’m not explicitly using any shared memory, and the CUDA.memory output suggests it doesn’t magically use any on its own, either. The register count is also always at 255, so that seems fine.

I can’t really provide a MWE since the minimum is very large, and the julia functions I’m generating depend on a bunch of complicated in-development things. So I’m not sure what I would provide… I doubt the PTX code is very helpful. It is essentially just thousands of sequential function calls, no loops and almost no ifs.

I should also mention that none of the PTX code uses alloc anywhere.

If you can share CUDA.@device_code dir="./devcode" @cuda launch=false kernel(...) output, that may help.

I have zipped and uploaded the output here: Proton Drive

The errors aren’t misleading in that this is what the CUDA API returns, but you’re just generating very bad atypical kernel code that exhausts the amount of memory that can be spilled, resulting in an “out of memory” error that doesn’t correspond to the usual exhausting of available device memory. You should try and change your code generation as to not use that many registers, and not spill that much, because regardless of whether you get this to compile or not the kernel is expected to execute extremely slowly, since you’ll be getting very low occupancy numbers. (TBF there are very niche cases where extremely low occupancy kernels can perform well, but those are very few and far between.)

2 Likes

Thanks a lot for the answer. That makes more sense to me now.

The errors aren’t misleading

I disagree, an error is misleading when it doesn’t tell you what actually went wrong. It’s just not CUDA.jl 's fault in this case.

So if I understand correctly, you’re saying that there is a specific and independent limit for the amount of register spills that can be handled. Now I have new questions about this. Is there a (searchable) name for this limit? Is it device-specific or just a CUDA limitation? Also what is the limit?

You should try and change your code generation as to not use that many registers, and not spill that much

I will do that either way, I just also wanted to understand the limits that I am running into. On that note, the problem of the second kernel, the ERROR_INVALID_VALUE CUDA error is still a different problem. Is this then just yet another limit on the actual kernel’s code size itself?

I’m generally having trouble finding this sort of information myself in CUDA documentation or device specifications, is there a good resource to consult for these sorts of issues?

CUDA C++ Programming Guide shows “Maximum amount of local memory per thread: 512 KB”

2 Likes

Thanks, I would not have found that in there. But now I wonder even more why the function where CUDA.jl reports 402 KB of local memory fails; if the limit is 512 KB then that should run. Unless the memory call doesn’t report everything that will actually be used.

Meanwhile I have found in some nvidia discussion threads that the size limit for kernels is apparently 2 million instructions (for example here). I couldn’t find that info in the programming guide though… Also, my functions are a few hundred thousand lines of code, not 2 million, so that should still be working in theory.

Also, sorry that I keep riding around on this. As I mentioned I’m mainly interested in understanding exactly where it stops working and why, not so much in “just” making it run.

1 Like

You’ll have a hard time, especially given your broad definition of “misleading” :slight_smile: CUDA has many such pitfalls (opaque error codes, non-documented behavior, etc) when you stray off the beaten path.

I wouldn’t assume lines of code and instructions are 1:1, wonder how we can count the latter.

Perhaps the stack frame size limit is what’s being hit here?
Please see the following for more information: What is the maximum CUDA Stack frame size per Kerenl. - CUDA Programming and Performance - NVIDIA Developer Forums

For the A30, stack frame size available per thread = 24GB/56/1024 = 439 KB.
This is approximate but seems very close to your 402 KB.
Hopefully I’ve understood correctly!

I’m wondering what techniques you used to mitigate register spilling, if you managed to do so.

1 Like

Hey, sorry I was on vacation for a while and only just got back.
I did quite a bit of kernel inspection with ncu-ui, but sadly, it also doesn’t give more helpful error messages about what went wrong.

For the A30, stack frame size available per thread = 24GB/56/1024 = 439 KB.
This is approximate but seems very close to your 402 KB.

I did find out that local memory is really just the same as global memory that gets allocated to specific threads, so your comment makes a lot of sense. The remaining few KB could be things like the code itself, which also gets large and might not be reported as local memory consumption of the kernel. But I’m not sure if that is actually stored in the same place.

I’m wondering what techniques you used to mitigate register spilling, if you managed to do so.

I’m not sure what you mean by mitigating it. My code has a lot of register spilling just from the way it’s generated. I don’t think there’s any way to “mitigate” this other than to generate different code, i.e., write a different kernel that uses less local memory.
With CUDA.jl, you can give @cuda a parameter of maxregs=N. By default, this was 255 for the A30, which limited the warps per SM to 2 and, thereby, the occupancy to 12.5%. I tried setting it to other values like 128 or 64, which resulted in higher occupancy but slower execution for me. Generally, always_inline=true seems like an extremely powerful tool. I’m not sure if there are cases where it shouldn’t be used.

I’m wondering why so much local memory is being allocated. Are you perhaps using a lot of StaticArrays?
For example, if I use the following toy example, 22 registers are used plus 1024 bytes of local memory:

rates = SVector{85,Float32}(0.00129,0.00143,0.00154,0.00160,0.00166,0.00168,0.00167,0.00164,0.00161,0.00157,0.00152,0.00148,
    0.00146,0.00144,0.00144,0.00144,0.00147,0.00150,0.00155,0.00161,0.00169,0.00177,0.00188,0.00200,
    0.00214,0.00229,0.00247,0.00265,0.00286,0.00307,0.00332,0.00359,0.00388,0.00419,0.00454,0.00491,
    0.00535,0.00586,0.00643,0.00709,0.00782,0.00863,0.00949,0.01042,0.01147,0.01264,0.01394,0.01542,
    0.01711,0.01902,0.02113,0.02340,0.02586,0.02850,0.03138,0.03463,0.03831,0.04256,0.04744,0.05292,
    0.05880,0.06506,0.07164,0.07847,0.08572,0.09367,0.10252,0.11252,0.12379,0.13611,0.14920,0.16280,
    0.17679,0.19089,0.20529,0.22019,0.23584,0.25275,0.27163,0.29565,0.32996,0.38455,0.48020,0.65798,
    1.00000)
rates2 = SVector{85,Float32}(0.00129,0.00143,0.00154,0.00160,0.00166,0.00168,0.00167,0.00164,0.00161,0.00157,0.00152,0.00148,
    0.00146,0.00144,0.00144,0.00144,0.00147,0.00150,0.00155,0.00161,0.00169,0.00177,0.00188,0.00200,
    0.00214,0.00229,0.00247,0.00265,0.00286,0.00307,0.00332,0.00359,0.00388,0.00419,0.00454,0.00491,
    0.00535,0.00586,0.00643,0.00709,0.00782,0.00863,0.00949,0.01042,0.01147,0.01264,0.01394,0.01542,
    0.01711,0.01902,0.02113,0.02340,0.02586,0.02850,0.03138,0.03463,0.03831,0.04256,0.04744,0.05292,
    0.05880,0.06506,0.07164,0.07847,0.08572,0.09367,0.10252,0.11252,0.12379,0.13611,0.14920,0.16280,
    0.17679,0.19089,0.20529,0.22019,0.23584,0.25275,0.27163,0.29565,0.32996,0.38455,0.48020,0.65798,
    1.00000)
rates3 = SVector{85,Float32}(0.00129,0.00143,0.00154,0.00160,0.00166,0.00168,0.00167,0.00164,0.00161,0.00157,0.00152,0.00148,
    0.00146,0.00144,0.00144,0.00144,0.00147,0.00150,0.00155,0.00161,0.00169,0.00177,0.00188,0.00200,
    0.00214,0.00229,0.00247,0.00265,0.00286,0.00307,0.00332,0.00359,0.00388,0.00419,0.00454,0.00491,
    0.00535,0.00586,0.00643,0.00709,0.00782,0.00863,0.00949,0.01042,0.01147,0.01264,0.01394,0.01542,
    0.01711,0.01902,0.02113,0.02340,0.02586,0.02850,0.03138,0.03463,0.03831,0.04256,0.04744,0.05292,
    0.05880,0.06506,0.07164,0.07847,0.08572,0.09367,0.10252,0.11252,0.12379,0.13611,0.14920,0.16280,
    0.17679,0.19089,0.20529,0.22019,0.23584,0.25275,0.27163,0.29565,0.32996,0.38455,0.48020,0.65798,
    1.00000)

rates_gpu = CUDA.cu(rates)
rates2_gpu = CUDA.cu(rates2)
rates3_gpu = CUDA.cu(rates3)

@inline function example2(age, rates, rates2, rates3) 
    return rates[age-14]+0.01f0*rates2[age-14]+0.0025f0*rates3[age-14]
end
function kernel_test2(age0, age1, rates, rates1, rates2, qres)

    sumqx = zero(Float32)
    for age=age0:age1
        sumqx += example2(age, rates, rates1, rates2)
    end
    qres[1] = sumqx

    nothing
end

qres = CUDA.zeros(Float32, 1)
cudakernel0 = @cuda launch=false kernel_test2(20, 90, rates_gpu, rates2_gpu, rates3_gpu, qres)
config = launch_configuration(cudakernel0.fun)
@show CUDA.registers(cudakernel0)
@show CUDA.memory(cudakernel0)

If normal vectors are used, then 24 registers are used but no local memory. The kernel is approx. 30x faster:

rates = Vector{Float32}([0.00129,0.00143,0.00154,0.00160,0.00166,0.00168,0.00167,0.00164,0.00161,0.00157,0.00152,0.00148,
    0.00146,0.00144,0.00144,0.00144,0.00147,0.00150,0.00155,0.00161,0.00169,0.00177,0.00188,0.00200,
    0.00214,0.00229,0.00247,0.00265,0.00286,0.00307,0.00332,0.00359,0.00388,0.00419,0.00454,0.00491,
    0.00535,0.00586,0.00643,0.00709,0.00782,0.00863,0.00949,0.01042,0.01147,0.01264,0.01394,0.01542,
    0.01711,0.01902,0.02113,0.02340,0.02586,0.02850,0.03138,0.03463,0.03831,0.04256,0.04744,0.05292,
    0.05880,0.06506,0.07164,0.07847,0.08572,0.09367,0.10252,0.11252,0.12379,0.13611,0.14920,0.16280,
    0.17679,0.19089,0.20529,0.22019,0.23584,0.25275,0.27163,0.29565,0.32996,0.38455,0.48020,0.65798,
    1.00000])
rates2 = Vector{Float32}([0.00129,0.00143,0.00154,0.00160,0.00166,0.00168,0.00167,0.00164,0.00161,0.00157,0.00152,0.00148,
    0.00146,0.00144,0.00144,0.00144,0.00147,0.00150,0.00155,0.00161,0.00169,0.00177,0.00188,0.00200,
    0.00214,0.00229,0.00247,0.00265,0.00286,0.00307,0.00332,0.00359,0.00388,0.00419,0.00454,0.00491,
    0.00535,0.00586,0.00643,0.00709,0.00782,0.00863,0.00949,0.01042,0.01147,0.01264,0.01394,0.01542,
    0.01711,0.01902,0.02113,0.02340,0.02586,0.02850,0.03138,0.03463,0.03831,0.04256,0.04744,0.05292,
    0.05880,0.06506,0.07164,0.07847,0.08572,0.09367,0.10252,0.11252,0.12379,0.13611,0.14920,0.16280,
    0.17679,0.19089,0.20529,0.22019,0.23584,0.25275,0.27163,0.29565,0.32996,0.38455,0.48020,0.65798,
    1.00000])
rates3 = Vector{Float32}([0.00129,0.00143,0.00154,0.00160,0.00166,0.00168,0.00167,0.00164,0.00161,0.00157,0.00152,0.00148,
    0.00146,0.00144,0.00144,0.00144,0.00147,0.00150,0.00155,0.00161,0.00169,0.00177,0.00188,0.00200,
    0.00214,0.00229,0.00247,0.00265,0.00286,0.00307,0.00332,0.00359,0.00388,0.00419,0.00454,0.00491,
    0.00535,0.00586,0.00643,0.00709,0.00782,0.00863,0.00949,0.01042,0.01147,0.01264,0.01394,0.01542,
    0.01711,0.01902,0.02113,0.02340,0.02586,0.02850,0.03138,0.03463,0.03831,0.04256,0.04744,0.05292,
    0.05880,0.06506,0.07164,0.07847,0.08572,0.09367,0.10252,0.11252,0.12379,0.13611,0.14920,0.16280,
    0.17679,0.19089,0.20529,0.22019,0.23584,0.25275,0.27163,0.29565,0.32996,0.38455,0.48020,0.65798,
    1.00000])

I note that the kernel signature for the SVector case has SArrays vs CuDeviceArrays.

Regarding general mitigation of register pressure, what I’ve looked at is using shared memory to temporarily store values that might otherwise be held in registers.
Using constant cache and splitting a monolithic kernel into multiple ones can also prove useful.

Mike

I forgot that you had supplied some code and that it all relates to QED.
So, it looks like you do make heavy use of static arrays but with many of them being only 4 elements (double and complex), so you’re not abusing things like I did in my example above.
However, if I change my example to use only 4 elements, the static arrays still spill over into local memory.
I think that you can prevent this if you use the Array constructor on the static arrays before converting to CuArrays. Unfortunately, I think you may need to also change every function signature so that it can handle CuDeviceArrays.
Though I wonder if there is a better solution that I’m missing here.

Mike

I’m a little confused about your suggestion. Are you saying that most of the register spilling might be a result of the input type that is initially copied to the GPU (the PhaseSpacePoint in my case) having SVectors inside? This would be baffling to me since I thought SVectors should essentially behave like a struct or a Tuple with all elements of one type. But maybe I’m fundamentally misunderstanding something here?
In any case, it’s unfortunately not easy for me to switch everything to a non-SVector based approach, since some of the lower functions rely on them for dispatch.

Regarding general mitigation of register pressure, what I’ve looked at is using shared memory to temporarily store values that might otherwise be held in registers.
Using constant cache and splitting a monolithic kernel into multiple ones can also prove useful.

Splitting my code into many smaller kernels is definitely the way to go here and has been the plan for a while. I expect that I can replace one of the current kernel calls by maybe 10-20 kernel calls in sequence, maybe 2 or 3 independently in parallel.
But since I’m not just coding this by hand but generating the kernels (and want to keep doing that so it stays generic), this is pretty difficult to implement.

Are you saying that most of the register spilling might be a result of the input type that is initially copied to the GPU (the PhaseSpacePoint in my case) having SVectors inside?

Maybe I’m being a bit presumptuous as to what you’re trying to do and you’re actually ok with the amount of “local memory” being used, even though it may be “slow”. It may still be quick enough for your purposes. You’ve been using Nsight Compute, so you’d have seen memory statistics - e.g. L1 and L2 cache hit rates etc. and be able to assess whether performance is good enough for you.

I was trying to show that static vectors do not get allocated to global memory, so I wondered if that could be an issue that you might want to resolve. One way to do this is to convert them to “real” arrays, but that may not be easy or possible in your case.
What I’ve also noticed is that you might be able to avoid this allocation if the compiler can resolve indexing of static vectors at compile time. Kernel arguments can instead end up in the constant address space (bank 0) and I believe you have 32KB available for that. There is also an additional 64K for user constants that you might like to exploit along with shared memory (though a little bit trickier to exploit). Again, I’m not saying that any of this will be easy for you to do - just laying out some potential options.

In compute__308885b0_ea17_11ef_1fa7_41e4c96af9f8_1.opt.ll you’ll see thousands of alloca instructions starting from e.g. line 6887. This is the compiler allocating memory on a per-thread local memory stack. It’s as if you allocated automatic variables in C.
Maybe that’s ok for you but, if not, you might think about changing your objects to be more gpu/gmem-friendly. For example, by using a struct of arrays rather than a struct of static vectors or an array of structs. I’m presuming you know all of this.

This would be baffling to me since I thought SVector s should essentially behave like a struct or a Tuple with all elements of one type. But maybe I’m fundamentally misunderstanding something here?
I think your understanding is correct, but they are not allocated to GMEM because of this. I’m just pointing this out in case it was an easy-ish thing for you to change.

Good luck!

I played around a little more and made the following example:

using CUDA
using QEDcore

function kernel(in1, in2, out, n)
    id = (blockIdx().x - one(Int32)) * blockDim().x + threadIdx().x
    if (id > n)
        return
    end

    @inbounds A = in1[id]
    @inbounds B = in2[id]
    t = A * B
    @inbounds out[id] = t
    return nothing
end

N = Int32(16)
T = Float32

in1 = CUDA.cu(rand(BiSpinor{T}, N))
in2 = CUDA.cu(rand(AdjointBiSpinor{T}, N))
out = CUDA.zeros(DiracMatrix{T}, N)

K = @cuda launch = false always_inline = true kernel(in1, in2, out, N)
@show CUDA.memory(K)
@show CUDA.registers(K)

(note that this requires the current dev version of QEDcore and QEDbase if you want to try it yourself)

This uses the same arithmetic and objects that are used in my generated kernels. With this example, I get 0 local memory and 32 registers used, which seems reasonable. This still uses SVectors internally to dispatch the correct multiplication methods.

I have found that always_inline makes a big difference (again). When this is disabled, the kernel uses local memory again. Also, @inbounds is very important; not having it inbounds uses local memory and more registers.

I already considered both of these things in my original example, though, which leads me to believe that the remaining local memory usage is simply from spilled registers and that the amount of memory really is required, at least when compiling it as one monolithic kernel as I’m currently doing.

rand(T, n) returns an array, so in1 and in2 are both CuArrays and so live in global device memory.
It’s not ideal from a memory coalescing point of view that they are arrays of structs rather than a struct of arrays, but it still avoids local memory.
The same holds for out.
If you use @device_code_ptx then you’ll see that all loads and stores are from/to global memory.

What I was trying to say in the last message was that, if a kernel input used static vectors, then it appears that the compiler would try to be fit everything into registers but then spill into local memory if the number of registers was insufficient. They don’t get allocated to the global address space, as far as I can tell. For example, using

#in1 = CUDA.cu(rand(QEDcore.BiSpinor{T}, N))
in1 = CUDA.cu(SVector{16, QEDcore.BiSpinor{T}}(rand(QEDcore.BiSpinor{T}, 16)))

results in 73 registers used and 256 bytes of local memory.

The trivial “fix” in this case is to use the Array constructor:

in1 = CUDA.cu(Array(SVector{16, QEDcore.BiSpinor{T}}(rand(QEDcore.BiSpinor{T}, 16))))

In the original example you gave, the kernel used a more complicated object (PhaseSpacePoint?). It may not be an SVector, but it could be a plain struct and not for example a struct of arrays. These won’t be automatically transformed into objects that use global memory. Can you perhaps give another example that uses PhaseSpacePoint in a simple manner?

An example that uses a struct of arrays would be:

tqs_cpu = StructArray(spn=rand(QEDcore.BiSpinor{T}, N), bspn=rand(QEDcore.AdjointBiSpinor{T}, N))
tqs_gpu = CUDA.cu(tqs_cpu) 

function kernel(in, out, n)
    id = (blockIdx().x - one(Int32)) * blockDim().x + threadIdx().x
    if (id > n)
        return
    end

    @inbounds A = in.spn[id] #in1[id]
    @inbounds B = in.bspn[id] #in2[id]
    t = A * B
    @inbounds out[id] = t
    return nothing
end

I should also say that I’m struggling to see where you have massive parallelization where gpus would be potentially beneficial. For example, the structarrays I use in one project have arrays with 190k elements, and 1000s of simulations are performed on each of these. Or is this just a step towards solving another problem where gpus would be beneficial? Or have I missed the point :confused:?

Also, if you want to dismiss bounds-checking you can always start a julia process with --check-bounds=no.