I am encountering a handful of performance issues using `CartesianIndices`

(versus using hard-coded nested `for`

loops) to write dimension-agnostic code:

- There is apparent runtime cost associated with using type parameters in code when it seems there shouldn’t be
- Iteration of
`CartesianIndices`

exhibits worse performance compared to hard-coded nested loops - Splatting
`Tuple(I::CartesianIndices)`

in the “wrong place” greatly degrades performance

I am hoping someone might help me better understand and avoid these issues, further explained below, as I am planning to use patterns like those below in some performance critical code. I am mainly concerned with arrays whose `size`

in each dimension is small (2-4) and whose `ndims`

typically range from 1 to 6.

I am using Julia v1.7.2.

# Setup

The function `fnfill_nd!(A::AbstractArray{K}, f, D::Int)`

below populates a `K`

-dimensional input array `A`

of size `(D, D, ⋯)`

using a function `f`

on the indices `(i,j,...)::Dims{K}`

, `i, j, ⋯ ∈ 1:D`

. For example, if `K=2`

, `D=3`

, and `f(i,j) = (i==j)`

, the result of `fnfill_nd!(A, ==, Val(2), 3)`

is equivalent to the 3×3 identity matrix. `fnfill_nd`

has two methods: one where `D`

is passed to `filleye_nd`

as a regular variable, and one where `D`

is inferred from the size the input `A::StaticArray`

.

```
using BenchmarkTools
using StaticArrays
using LinearAlgebra
function fnfill_nd!(A::AbstractArray{<:Any,K}, f, D::Int) where K
for I ∈ CartesianIndices(ntuple(_ -> D, Val(K)))
A[I] = f(I)
end
return A
end
function fnfill_nd!(A::StaticArray{NTuple{K,D}}, f) where {K,D}
for I ∈ CartesianIndices(ntuple(_ -> D, Val(K)))
A[I] = f(I)
end
return A
end
```

Below, I define two simple functions on the indices of two- and four-dimensional arrays, respectively, to which we can apply `fnfill_nd!`

.

```
# The kronecker delta, δᵢⱼ
@inline kron2(i::Int, j::Int) = (i==j)
@inline kron2(I::CartesianIndex{2}) = kron2(Tuple(I)...)
# A product of two kronecker deltas: δᵢₖ δⱼₗ
@inline kron4(i,j,k,l) = (i==k) && (j==l)
@inline kron4(I::CartesianIndex{4}) = kron4(Tuple(I)...)
```

#
Issue 1: Apparent runtime cost of `D`

as a type parameter

Here are benchmarks of `fnfill_nd!`

applied to `kron2`

and `kron4`

.

```
println("Two-dimensional case, dimension-agnostic iteration over CartesianIndices")
let buf = MMatrix{3,3,Bool}(undef)
println(" Dimension size D specified as regular variable")
result = @btime fnfill_nd!($buf, kron2, 3)
@assert result == I[1:3, 1:3]
println(" Dimension size D infered as a static parameter of `buf`")
result = @btime fnfill_nd!($buf, kron2)
@assert result == I[1:3, 1:3]
end;
println()
println("Four-dimensional case, dimension-agnostic iteration over CartesianIndices")
let buf = MArray{NTuple{4,3}, Bool}(undef)
println(" Dimension size D passed as regular variable")
result = @btime fnfill_nd!($buf, kron4, 3)
println(" Dimension size D infered as a static parameter of `buf`")
result = @btime fnfill_nd!($buf, kron4)
end;
```

```
Two-dimensional case, dimension-agnostic iteration over CartesianIndices
Dimension size D specified as regular variable
5.587 ns (0 allocations: 0 bytes)
Dimension size D infered as a static parameter of `buf`
11.885 ns (0 allocations: 0 bytes)
Four-dimensional case, dimension-agnostic iteration over CartesianIndices
Dimension size D passed as regular variable
107.511 ns (0 allocations: 0 bytes)
Dimension size D infered as a static parameter of `buf`
100.239 ns (0 allocations: 0 bytes)
```

The first of the above results (for two dimensions) suggests an cost of using the type parameter `D`

versus a simple value. I am wondering if someone could help me understand where that cost is coming from.

Note that the latter result for 4 dimensions is inconsistent as to which method is faster, as the relative difference is small.

#
Issue 2: Performance of `CartesianIndices`

compared to hard-coded nested loops

Here are functions analogous to `fnfill_nd!`

, but with hard-coded loops for two and four dimensions.

```
function fnfill_2d!(A::AbstractArray, f, D::Int)
for j ∈ 1:D, i ∈ 1:D
I = CartesianIndex{2}(i,j)
A[I] = f(I)
end
return A
end
function fnfill_2d!(A::StaticArray{NTuple{2,D}}, f) where D
for j ∈ 1:D, i ∈ 1:D
I = CartesianIndex{2}(i,j)
A[I] = f(I)
end
return A
end
function fnfill_4d!(A::AbstractArray, f, D::Int)
for l ∈ 1:D, k ∈ 1:D, j ∈ 1:D, i ∈ 1:D
I = CartesianIndex{4}(i,j,k,l)
A[I] = f(I)
end
return A
end
function fnfill_4d!(A::StaticArray{NTuple{4,D}}, f) where D
for l ∈ 1:D, k ∈ 1:D, j ∈ 1:D, i ∈ 1:D
I = CartesianIndex{4}(i,j,k,l)
A[I] = f(I)
end
return A
end
```

We can benchmark these and compare to the analogous benchmarks of `fnfill_nd`

.

```
println("Two-dimensional case, hard-coded loops")
let buf = MMatrix{3,3,Bool}(undef)
println(" Dimension size D specified as regular variable")
result = @btime fnfill_2d!($buf, kron2, 3)
@assert result == I[1:3, 1:3]
println(" Dimension size D infered as a static parameter of `buf`")
result = @btime fnfill_2d!($buf, kron2)
@assert result == I[1:3, 1:3]
end;
println()
println("Four-dimensional case, hard-coded loops")
let buf = MArray{NTuple{4,3}, Bool}(undef)
println("Dimension size D passed as regular variable")
result = @btime fnfill_4d!($buf, kron4, 3)
println(" Dimension size D infered as a static parameter of `buf`")
result = @btime fnfill_4d!($buf, kron4)
@assert fnfill_4d!(buf, kron4) == fnfill_nd!(buf, kron4)
end;
```

```
Two-dimensional case, hard-coded loops
Dimension size D specified as regular variable
7.970 ns (0 allocations: 0 bytes)
Dimension size D infered as a static parameter of `buf`
2.374 ns (0 allocations: 0 bytes)
Four-dimensional case, hard-coded loops
Dimension size D passed as regular variable
68.196 ns (0 allocations: 0 bytes)
Dimension size D infered as a static parameter of `buf`
24.403 ns (0 allocations: 0 bytes)
```

Except for the first case, the hard-coded loops are significantly more efficient, and, in contrast to the previous case, using the type parameter `D`

is a boon to performance. My (inexpert) guess here is that the compiler is doing some optimization, like unrolling and/or applying some SIMD vectorization to the inner loop(s) because they have a known number of iterations, and, unfortunately, this optimization cannot be done through `CartesianIndices`

.

My question is then is there a way to have “elegant” dimension-independent code without sacrificing performance? Right now, it appears that if performance is paramount then one should avoid `CartesianIndices`

, and either manually write a separate method for each dimension or use metaprogramming to help automate this task (which I have so far tried to avoid). Alternatively, maybe some statically-sized analogy of `CartesianIndices`

would do.

# Issue 3: Using splatting in the wrong place greatly degrades performance

Consider a subtly different version of `fnfill_nd!`

that invokes `f(Tuple(I::CartesianIndices)...)`

rather than `f(I)`

.

```
function splatting_fnfill_nd!(A::AbstractArray{<:Any,K}, f, D::Int) where K
for I ∈ CartesianIndices(ntuple(_ -> D, Val(K)))
A[I] = f(Tuple(I)...)
end
return A
end
let buf = MMatrix{3,3,Bool}(undef)
result = @btime splatting_fnfill_nd!($buf, kron2, 3)
@assert result == I[1:3, 1:3]
println()
@code_warntype splatting_fnfill_nd!(buf, kron2, 3)
end;
```

```
226.983 ns (0 allocations: 0 bytes)
MethodInstance for splatting_fnfill_nd!(::MMatrix{3, 3, Bool, 9}, ::typeof(kron2), ::Int64)
from splatting_fnfill_nd!(A::AbstractArray{<:Any, K}, f, D::Int64) where K in Main at In[20]:1
Static Parameters
K = 2
Arguments
#self#::Core.Const(splatting_fnfill_nd!)
A::MMatrix{3, 3, Bool, 9}
f::Core.Const(kron2)
D::Int64
Locals
@_5::Union{Nothing, Tuple{CartesianIndex{2}, CartesianIndex{2}}}
#17::var"#17#18"{Int64}
I::CartesianIndex{2}
Body::MMatrix{3, 3, Bool, 9}
1 ─ %1 = Main.:(var"#17#18")::Core.Const(var"#17#18")
│ %2 = Core.typeof(D)::Core.Const(Int64)
│ %3 = Core.apply_type(%1, %2)::Core.Const(var"#17#18"{Int64})
│ (#17 = %new(%3, D))
│ %5 = #17::var"#17#18"{Int64}
│ %6 = Main.Val($(Expr(:static_parameter, 1)))::Core.Const(Val{2}())
│ %7 = Main.ntuple(%5, %6)::Tuple{Int64, Int64}
│ %8 = Main.CartesianIndices(%7)::CartesianIndices{2, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}}
│ (@_5 = Base.iterate(%8))
│ %10 = (@_5 === nothing)::Bool
│ %11 = Base.not_int(%10)::Bool
└── goto #4 if not %11
2 ┄ %13 = @_5::Tuple{CartesianIndex{2}, CartesianIndex{2}}
│ (I = Core.getfield(%13, 1))
│ %15 = Core.getfield(%13, 2)::CartesianIndex{2}
│ %16 = Main.Tuple(I)::Tuple{Int64, Int64}
│ %17 = Core._apply_iterate(Base.iterate, f, %16)::Bool
│ Base.setindex!(A, %17, I)
│ (@_5 = Base.iterate(%8, %15))
│ %20 = (@_5 === nothing)::Bool
│ %21 = Base.not_int(%20)::Bool
└── goto #4 if not %21
3 ─ goto #2
4 ┄ return A
```

Ouch! `splatting_fnfill_nd!`

is approximately 40 times slower than `fnfill_nd!`

. What happened? `@btime`

reports no allocations and `@code_warntype`

suggests that `splatting_fnfill_nd!`

is type stable, at least as far as it should be. The appearance of a local variable of type `Union{Nothing, Tuple{CartesianIndex{2}, CartesianIndex{2}}}`

, AFAIK, is a result of calling `iterate(::CartesianIndices)`

and is not usually a performance concern. Basically, the only difference here is where I chose to do the splatting, which suggests that the method of `kron2`

or `kron4`

which takes a `CartesianIndex`

and then splats acts as a function barrier. However, I did not think that a function barrier would be necessary here as long as the splatted `Tuple`

is of a fully inferred type.

The problem also occurs when using explicit looping.

```
function splatting_fnfill_2d!(A::AbstractArray, f, D::Int)
for j ∈ 1:D, i ∈ 1:D
I = CartesianIndex{2}(i,j)
A[I] = f(Tuple(I)...)
end
return A
end
let buf = MMatrix{3,3,Bool}(undef)
result = @btime splatting_fnfill_2d!($buf, kron2, 3)
@assert result == I[1:3, 1:3]
end;
```

```
238.840 ns (0 allocations: 0 bytes)
```

However, performance is recovered if one avoids splatting in the loop.

```
function nosplatting_fnfill_2d!(A::AbstractArray, f, D::Int)
for j ∈ 1:D, i ∈ 1:D
I = CartesianIndex{2}(i,j)
A[I] = f(I[1], I[2])
end
return A
end
let buf = MMatrix{3,3,Bool}(undef)
result = @btime nosplatting_fnfill_2d!($buf, kron2, 3)
@assert result == I[1:3, 1:3]
end;
```

```
6.572 ns (0 allocations: 0 bytes)
```

It seems like the moral here is to avoid splatting in loops even when using type stable functions, but I do not understand why.

**Thanks!**

Edit: neatness.