CUDA(.jl) memory errors for very large kernels

I have some questions about what some CUDA errors mean and what I can do about them.

I’m dynamically generating code that I can successfully compile for both CPU and GPU targets (the GPU I’m using here is an A30 with 24GiB of device memory). This generated code becomes very large. For example, one of my functions that runs fine on the GPU has

  • CUDA.memory(kernel)(local = 210896, shared = 0, constant = 0),
  • CUDA.registers(kernel)255,
  • ~39k lines of PTX code.

The input to the kernel is very small (well under 1kB per element) in comparison to the device’s memory and should barely have any effect.

Errors start to happen at the next larger functions I can generate. For a function:

  • CUDA.memory(kernel)(local = 403592, shared = 0, constant = 0),
  • CUDA.registers(kernel)255,
  • ~78k lines of PTX code,

an attempt to run the kernel immediately fails with:

Out of GPU memory
Effective GPU memory usage: 0.42% (101.000 MiB/23.599 GiB)
Memory pool usage: 8.250 KiB (32.000 MiB reserved)

Stacktrace:
  [1] throw_api_error(res::CUDA.cudaError_enum)
    @ CUDA ~/.julia/packages/CUDA/1kIOw/lib/cudadrv/libcuda.jl:28
  [2] check
    @ ~/.julia/packages/CUDA/1kIOw/lib/cudadrv/libcuda.jl:37 [inlined]
  [3] cuLaunchKernel
    @ ~/.julia/packages/CUDA/1kIOw/lib/utils/call.jl:34 [inlined]
  [4] (::CUDA.var"#966#967"{Bool, Int64, CuStream, CuFunction, CuDim3, CuDim3})(kernelParams::Vector{Ptr{Nothing}})
    @ CUDA ~/.julia/packages/CUDA/1kIOw/lib/cudadrv/execution.jl:66
  [5] macro expansion
    @ ~/.julia/packages/CUDA/1kIOw/lib/cudadrv/execution.jl:33 [inlined]
...

This doesn’t make much sense to me at all. It looks like there should be well enough memory available. I started the kernel with 32 threads and 1 block, so 403592B * 32 should only be about 12MiB, far from the available 24GiB. Is it somehow the kernel size itself or is there some other reason?

Additionally, an even larger function fails differently:

  • CUDA.memory(kernel)(local = 1735680, shared = 0, constant = 0),
  • CUDA.registers(kernel)255,
  • ~320k lines of PTX code.

This call fails with an even less meaningful error message:

CUDA error: invalid argument (code 1, ERROR_INVALID_VALUE)

Stacktrace:
  [1] throw_api_error(res::CUDA.cudaError_enum)
    @ CUDA ~/.julia/packages/CUDA/1kIOw/lib/cudadrv/libcuda.jl:30
  [2] check
    @ ~/.julia/packages/CUDA/1kIOw/lib/cudadrv/libcuda.jl:37 [inlined]
  [3] cuLaunchKernel
    @ ~/.julia/packages/CUDA/1kIOw/lib/utils/call.jl:34 [inlined]
  [4] (::CUDA.var"#966#967"{Bool, Int64, CuStream, CuFunction, CuDim3, CuDim3})(kernelParams::Vector{Ptr{Nothing}})
    @ CUDA ~/.julia/packages/CUDA/1kIOw/lib/cudadrv/execution.jl:66
  [5] macro expansion
    @ ~/.julia/packages/CUDA/1kIOw/lib/cudadrv/execution.jl:33 [inlined]
...

Note that both these functions compile just fine, in a minute or two (using @cuda launch=false .... They only fail when trying to run them.

And finally, if I use the always_inline = true option for the @cuda calls, I get a function:

  • CUDA.memory(kernel)(local = 33144, shared = 0, constant = 0),
  • CUDA.registers(kernel)255,
  • ~127k lines of PTX code.

This function runs fine, even though it has substantially more lines of code than the first failing one. The only metric that is way better is the local memory usage. While it’s nice that this works, I don’t want to be using always_inline everywhere because it also increases the code size.

So the question is, what exactly do these error messages mean, do they come from CUDA.jl or from libcuda, and are they fixable (assuming I can’t make the function any smaller)?

1 Like

The errors seem a bit misleading.
I’d assume you may be running out of shared memory or registers.
Have you checked that?

If there’s a way you can provide MWE it’d be helpful.

I’m not explicitly using any shared memory, and the CUDA.memory output suggests it doesn’t magically use any on its own, either. The register count is also always at 255, so that seems fine.

I can’t really provide a MWE since the minimum is very large, and the julia functions I’m generating depend on a bunch of complicated in-development things. So I’m not sure what I would provide… I doubt the PTX code is very helpful. It is essentially just thousands of sequential function calls, no loops and almost no ifs.

I should also mention that none of the PTX code uses alloc anywhere.

If you can share CUDA.@device_code dir="./devcode" @cuda launch=false kernel(...) output, that may help.

I have zipped and uploaded the output here: Proton Drive

The errors aren’t misleading in that this is what the CUDA API returns, but you’re just generating very bad atypical kernel code that exhausts the amount of memory that can be spilled, resulting in an “out of memory” error that doesn’t correspond to the usual exhausting of available device memory. You should try and change your code generation as to not use that many registers, and not spill that much, because regardless of whether you get this to compile or not the kernel is expected to execute extremely slowly, since you’ll be getting very low occupancy numbers. (TBF there are very niche cases where extremely low occupancy kernels can perform well, but those are very few and far between.)

2 Likes

Thanks a lot for the answer. That makes more sense to me now.

The errors aren’t misleading

I disagree, an error is misleading when it doesn’t tell you what actually went wrong. It’s just not CUDA.jl 's fault in this case.

So if I understand correctly, you’re saying that there is a specific and independent limit for the amount of register spills that can be handled. Now I have new questions about this. Is there a (searchable) name for this limit? Is it device-specific or just a CUDA limitation? Also what is the limit?

You should try and change your code generation as to not use that many registers, and not spill that much

I will do that either way, I just also wanted to understand the limits that I am running into. On that note, the problem of the second kernel, the ERROR_INVALID_VALUE CUDA error is still a different problem. Is this then just yet another limit on the actual kernel’s code size itself?

I’m generally having trouble finding this sort of information myself in CUDA documentation or device specifications, is there a good resource to consult for these sorts of issues?

CUDA C++ Programming Guide shows “Maximum amount of local memory per thread: 512 KB”

2 Likes

Thanks, I would not have found that in there. But now I wonder even more why the function where CUDA.jl reports 402 KB of local memory fails; if the limit is 512 KB then that should run. Unless the memory call doesn’t report everything that will actually be used.

Meanwhile I have found in some nvidia discussion threads that the size limit for kernels is apparently 2 million instructions (for example here). I couldn’t find that info in the programming guide though… Also, my functions are a few hundred thousand lines of code, not 2 million, so that should still be working in theory.

Also, sorry that I keep riding around on this. As I mentioned I’m mainly interested in understanding exactly where it stops working and why, not so much in “just” making it run.

1 Like

You’ll have a hard time, especially given your broad definition of “misleading” :slight_smile: CUDA has many such pitfalls (opaque error codes, non-documented behavior, etc) when you stray off the beaten path.

I wouldn’t assume lines of code and instructions are 1:1, wonder how we can count the latter.