Val{N} + LinearIndices Causes Massive Compile-Time Unrolling

I’m seeing strange behavior when using @code_warntype on a function that takes a Val{N} parameter and creates a LinearIndices object with size (N, N, N). Here’s a minimal example:

function test(::Val{N}) where {N}
    LinearIndices((1:N, 1:N, 1:N))
end
@code_warntype test(Val(100))

The output includes a massive list of literal integers:

996900 997000 997100 997200 ... 1000000])
└──      return %9

This was surprising and prompted me to investigate further, because I encountered a serious performance issue when using similar code in a GPU kernel via KernelAbstractions.jl. Specifically, compiling the kernel resulted in:

588.306677 seconds (13.85 M CPU allocations: 2.513 GiB, 0.05% gc time), 0.00% GPU memmgmt time

It seems that passing a large N as a type parameter (Val{N}) causes excessive unrolling or compile-time expansion of index structures. My questions are:

Is this behavior expected when using Val{N} with large N in combination with LinearIndices?

Is there a recommended way to avoid this kind of compile-time explosion while maintaining type stability?

Are there best practices for using Val-based parameters safely in GPU kernels with KernelAbstractions?

Any insight would be appreciated.

On v1.11 and below, this displayed form is used because the 2-argument show for LinearIndices is not specialized, and it falls back to the default show for AbstractArrays. A specialized method is now added on the upcoming v1.12, so we display something a bit more meaningful:

julia> @code_warntype test(Val(6))
MethodInstance for test(::Val{6})
  from test(::Val{N}) where N @ Main REPL[1]:1
Static Parameters
  N = 6
Arguments
  #self#::Core.Const(Main.test)
  _::Core.Const(Val{6}())
Body::LinearIndices{3, Tuple{UnitRange{Int64}, UnitRange{Int64}, UnitRange{Int64}}}
1 ─ %1  = Main.LinearIndices::Core.Const(LinearIndices)
│   %2  = Main.:(:)::Core.Const(Colon())
│   %3  = $(Expr(:static_parameter, 1))::Core.Const(6)
│   %4  = (%2)(1, %3)::Core.Const(1:6)
│   %5  = Main.:(:)::Core.Const(Colon())
│   %6  = $(Expr(:static_parameter, 1))::Core.Const(6)
│   %7  = (%5)(1, %6)::Core.Const(1:6)
│   %8  = Main.:(:)::Core.Const(Colon())
│   %9  = $(Expr(:static_parameter, 1))::Core.Const(6)
│   %10 = (%8)(1, %9)::Core.Const(1:6)
│   %11 = Core.tuple(%4, %7, %10)::Core.Const((1:6, 1:6, 1:6))
│   %12 = (%1)(%11)::Core.Const(LinearIndices((1:6, 1:6, 1:6)))
└──       return %12

I doubt this is where your performance issue arises from, since the struct simply contains three UnitRanges.

Thanks again. Here’s a concise summary of what I observed:

  • Removing LinearIndices from the GPU kernel significantly improved compile-time performance in my app.

    • With LinearIndices(axes):

      588.306677 seconds (13.85 M CPU allocations: 2.513 GiB, 0.05% gc time)
      
    • Replaced with 1:prod(length.(axes)):

      2.193661 seconds (3.83 M allocations: 198.139 MiB, 72.04% compilation time)
      
  • The slowdown only occurs on the first execution, which strongly suggests a compile-time issue rather than runtime inefficiency.

  • When axes=(1:64,1:64,1:64), first execution is not a problem, but when axes=(1:256, 1:256, 1:256) first execution is slow.

  • Based on this, I suspect that LinearIndices may be triggering excessive specialization or compile-time computation when used in CUDA kernels, perhaps due to the complex type structure it introduces. (Though the idea that this causes large memory pressure during compilation is just my own speculation.)

In my case, I didn’t strictly need LinearIndices, so switching to a simple linear range solved the issue. However, I’d still appreciate any insight into:

  • Why LinearIndices causes such heavy compile-time behavior in GPU code,
  • And whether there’s a better way to handle multi-dimensional indexing efficiently in kernels.

And maybe this is more related to GPU compilation specifically, so I’ll move this topic to a more appropriate category.
Any advice or internal explanation would be very helpful.