I have some questions about what some CUDA errors mean and what I can do about them.
I’m dynamically generating code that I can successfully compile for both CPU and GPU targets (the GPU I’m using here is an A30 with 24GiB of device memory). This generated code becomes very large. For example, one of my functions that runs fine on the GPU has
CUDA.memory(kernel)
→(local = 210896, shared = 0, constant = 0)
,CUDA.registers(kernel)
→255
,- ~39k lines of PTX code.
The input to the kernel is very small (well under 1kB per element) in comparison to the device’s memory and should barely have any effect.
Errors start to happen at the next larger functions I can generate. For a function:
CUDA.memory(kernel)
→(local = 403592, shared = 0, constant = 0)
,CUDA.registers(kernel)
→255
,- ~78k lines of PTX code,
an attempt to run the kernel immediately fails with:
Out of GPU memory
Effective GPU memory usage: 0.42% (101.000 MiB/23.599 GiB)
Memory pool usage: 8.250 KiB (32.000 MiB reserved)
Stacktrace:
[1] throw_api_error(res::CUDA.cudaError_enum)
@ CUDA ~/.julia/packages/CUDA/1kIOw/lib/cudadrv/libcuda.jl:28
[2] check
@ ~/.julia/packages/CUDA/1kIOw/lib/cudadrv/libcuda.jl:37 [inlined]
[3] cuLaunchKernel
@ ~/.julia/packages/CUDA/1kIOw/lib/utils/call.jl:34 [inlined]
[4] (::CUDA.var"#966#967"{Bool, Int64, CuStream, CuFunction, CuDim3, CuDim3})(kernelParams::Vector{Ptr{Nothing}})
@ CUDA ~/.julia/packages/CUDA/1kIOw/lib/cudadrv/execution.jl:66
[5] macro expansion
@ ~/.julia/packages/CUDA/1kIOw/lib/cudadrv/execution.jl:33 [inlined]
...
This doesn’t make much sense to me at all. It looks like there should be well enough memory available. I started the kernel with 32 threads and 1 block, so 403592B * 32
should only be about 12MiB, far from the available 24GiB. Is it somehow the kernel size itself or is there some other reason?
Additionally, an even larger function fails differently:
CUDA.memory(kernel)
→(local = 1735680, shared = 0, constant = 0)
,CUDA.registers(kernel)
→255
,- ~320k lines of PTX code.
This call fails with an even less meaningful error message:
CUDA error: invalid argument (code 1, ERROR_INVALID_VALUE)
Stacktrace:
[1] throw_api_error(res::CUDA.cudaError_enum)
@ CUDA ~/.julia/packages/CUDA/1kIOw/lib/cudadrv/libcuda.jl:30
[2] check
@ ~/.julia/packages/CUDA/1kIOw/lib/cudadrv/libcuda.jl:37 [inlined]
[3] cuLaunchKernel
@ ~/.julia/packages/CUDA/1kIOw/lib/utils/call.jl:34 [inlined]
[4] (::CUDA.var"#966#967"{Bool, Int64, CuStream, CuFunction, CuDim3, CuDim3})(kernelParams::Vector{Ptr{Nothing}})
@ CUDA ~/.julia/packages/CUDA/1kIOw/lib/cudadrv/execution.jl:66
[5] macro expansion
@ ~/.julia/packages/CUDA/1kIOw/lib/cudadrv/execution.jl:33 [inlined]
...
Note that both these functions compile just fine, in a minute or two (using @cuda launch=false ...
. They only fail when trying to run them.
And finally, if I use the always_inline = true
option for the @cuda
calls, I get a function:
CUDA.memory(kernel)
→(local = 33144, shared = 0, constant = 0)
,CUDA.registers(kernel)
→255
,- ~127k lines of PTX code.
This function runs fine, even though it has substantially more lines of code than the first failing one. The only metric that is way better is the local memory usage. While it’s nice that this works, I don’t want to be using always_inline
everywhere because it also increases the code size.
So the question is, what exactly do these error messages mean, do they come from CUDA.jl or from libcuda, and are they fixable (assuming I can’t make the function any smaller)?