# Performance issues with CartesianIndices

I am encountering a handful of performance issues using `CartesianIndices` (versus using hard-coded nested `for` loops) to write dimension-agnostic code:

1. There is apparent runtime cost associated with using type parameters in code when it seems there shouldn’t be
2. Iteration of `CartesianIndices` exhibits worse performance compared to hard-coded nested loops
3. Splatting `Tuple(I::CartesianIndices)` in the “wrong place” greatly degrades performance

I am hoping someone might help me better understand and avoid these issues, further explained below, as I am planning to use patterns like those below in some performance critical code. I am mainly concerned with arrays whose `size` in each dimension is small (2-4) and whose `ndims` typically range from 1 to 6.

I am using Julia v1.7.2.

# Setup

The function `fnfill_nd!(A::AbstractArray{K}, f, D::Int)` below populates a `K`-dimensional input array `A` of size `(D, D, ⋯)` using a function `f` on the indices `(i,j,...)::Dims{K}`, `i, j, ⋯ ∈ 1:D`. For example, if `K=2`, `D=3`, and `f(i,j) = (i==j)`, the result of `fnfill_nd!(A, ==, Val(2), 3)` is equivalent to the 3×3 identity matrix. `fnfill_nd` has two methods: one where `D` is passed to `filleye_nd` as a regular variable, and one where `D` is inferred from the size the input `A::StaticArray`.

``````using BenchmarkTools
using StaticArrays
using LinearAlgebra

function fnfill_nd!(A::AbstractArray{<:Any,K}, f, D::Int) where K
for I ∈ CartesianIndices(ntuple(_ -> D, Val(K)))
A[I] = f(I)
end
return A
end

function fnfill_nd!(A::StaticArray{NTuple{K,D}}, f) where {K,D}
for I ∈ CartesianIndices(ntuple(_ -> D, Val(K)))
A[I] = f(I)
end
return A
end
``````

Below, I define two simple functions on the indices of two- and four-dimensional arrays, respectively, to which we can apply `fnfill_nd!`.

``````# The kronecker delta, δᵢⱼ
@inline kron2(i::Int, j::Int) = (i==j)
@inline kron2(I::CartesianIndex{2}) = kron2(Tuple(I)...)
# A product of two kronecker deltas: δᵢₖ δⱼₗ
@inline kron4(i,j,k,l) = (i==k) && (j==l)
@inline kron4(I::CartesianIndex{4}) = kron4(Tuple(I)...)
``````

# Issue 1: Apparent runtime cost of `D` as a type parameter

Here are benchmarks of `fnfill_nd!` applied to `kron2` and `kron4`.

``````println("Two-dimensional case, dimension-agnostic iteration over CartesianIndices")
let buf = MMatrix{3,3,Bool}(undef)
println("  Dimension size D specified as regular variable")
result = @btime fnfill_nd!(\$buf, kron2, 3)
@assert result == I[1:3, 1:3]

println("  Dimension size D infered as a static parameter of `buf`")
result = @btime fnfill_nd!(\$buf, kron2)
@assert result == I[1:3, 1:3]
end;
println()

println("Four-dimensional case, dimension-agnostic iteration over CartesianIndices")
let buf = MArray{NTuple{4,3}, Bool}(undef)
println("  Dimension size D passed as regular variable")
result = @btime fnfill_nd!(\$buf, kron4, 3)

println("  Dimension size D infered as a static parameter of `buf`")
result = @btime fnfill_nd!(\$buf, kron4)
end;
``````
``````    Two-dimensional case, dimension-agnostic iteration over CartesianIndices
Dimension size D specified as regular variable
5.587 ns (0 allocations: 0 bytes)
Dimension size D infered as a static parameter of `buf`
11.885 ns (0 allocations: 0 bytes)

Four-dimensional case, dimension-agnostic iteration over CartesianIndices
Dimension size D passed as regular variable
107.511 ns (0 allocations: 0 bytes)
Dimension size D infered as a static parameter of `buf`
100.239 ns (0 allocations: 0 bytes)
``````

The first of the above results (for two dimensions) suggests an cost of using the type parameter `D` versus a simple value. I am wondering if someone could help me understand where that cost is coming from.

Note that the latter result for 4 dimensions is inconsistent as to which method is faster, as the relative difference is small.

# Issue 2: Performance of `CartesianIndices` compared to hard-coded nested loops

Here are functions analogous to `fnfill_nd!`, but with hard-coded loops for two and four dimensions.

``````function fnfill_2d!(A::AbstractArray, f, D::Int)
for j ∈ 1:D, i ∈ 1:D
I = CartesianIndex{2}(i,j)
A[I] = f(I)
end
return A
end

function fnfill_2d!(A::StaticArray{NTuple{2,D}}, f) where D
for j ∈ 1:D, i ∈ 1:D
I = CartesianIndex{2}(i,j)
A[I] = f(I)
end
return A
end

function fnfill_4d!(A::AbstractArray, f, D::Int)
for l ∈ 1:D, k ∈ 1:D, j ∈ 1:D, i ∈ 1:D
I = CartesianIndex{4}(i,j,k,l)
A[I] = f(I)
end
return A
end

function fnfill_4d!(A::StaticArray{NTuple{4,D}}, f) where D
for l ∈ 1:D, k ∈ 1:D, j ∈ 1:D, i ∈ 1:D
I = CartesianIndex{4}(i,j,k,l)
A[I] = f(I)
end
return A
end
``````

We can benchmark these and compare to the analogous benchmarks of `fnfill_nd`.

``````println("Two-dimensional case, hard-coded loops")
let buf = MMatrix{3,3,Bool}(undef)
println("  Dimension size D specified as regular variable")
result = @btime fnfill_2d!(\$buf, kron2, 3)
@assert result == I[1:3, 1:3]

println("  Dimension size D infered as a static parameter of `buf`")
result = @btime fnfill_2d!(\$buf, kron2)
@assert result == I[1:3, 1:3]
end;
println()

println("Four-dimensional case, hard-coded loops")
let buf = MArray{NTuple{4,3}, Bool}(undef)

println("Dimension size D passed as regular variable")
result = @btime fnfill_4d!(\$buf, kron4, 3)

println("  Dimension size D infered as a static parameter of `buf`")
result = @btime fnfill_4d!(\$buf, kron4)

@assert fnfill_4d!(buf, kron4) == fnfill_nd!(buf, kron4)
end;
``````
``````Two-dimensional case, hard-coded loops
Dimension size D specified as regular variable
7.970 ns (0 allocations: 0 bytes)
Dimension size D infered as a static parameter of `buf`
2.374 ns (0 allocations: 0 bytes)

Four-dimensional case, hard-coded loops
Dimension size D passed as regular variable
68.196 ns (0 allocations: 0 bytes)
Dimension size D infered as a static parameter of `buf`
24.403 ns (0 allocations: 0 bytes)
``````

Except for the first case, the hard-coded loops are significantly more efficient, and, in contrast to the previous case, using the type parameter `D` is a boon to performance. My (inexpert) guess here is that the compiler is doing some optimization, like unrolling and/or applying some SIMD vectorization to the inner loop(s) because they have a known number of iterations, and, unfortunately, this optimization cannot be done through `CartesianIndices`.

My question is then is there a way to have “elegant” dimension-independent code without sacrificing performance? Right now, it appears that if performance is paramount then one should avoid `CartesianIndices`, and either manually write a separate method for each dimension or use metaprogramming to help automate this task (which I have so far tried to avoid). Alternatively, maybe some statically-sized analogy of `CartesianIndices` would do.

# Issue 3: Using splatting in the wrong place greatly degrades performance

Consider a subtly different version of `fnfill_nd!` that invokes `f(Tuple(I::CartesianIndices)...)` rather than `f(I)`.

``````function splatting_fnfill_nd!(A::AbstractArray{<:Any,K}, f, D::Int) where K
for I ∈ CartesianIndices(ntuple(_ -> D, Val(K)))
A[I] = f(Tuple(I)...)
end
return A
end

let buf = MMatrix{3,3,Bool}(undef)
result = @btime splatting_fnfill_nd!(\$buf, kron2, 3)
@assert result == I[1:3, 1:3]
println()
@code_warntype splatting_fnfill_nd!(buf, kron2, 3)
end;
``````
``````      226.983 ns (0 allocations: 0 bytes)

MethodInstance for splatting_fnfill_nd!(::MMatrix{3, 3, Bool, 9}, ::typeof(kron2), ::Int64)
from splatting_fnfill_nd!(A::AbstractArray{<:Any, K}, f, D::Int64) where K in Main at In:1
Static Parameters
K = 2
Arguments
#self#::Core.Const(splatting_fnfill_nd!)
A::MMatrix{3, 3, Bool, 9}
f::Core.Const(kron2)
D::Int64
Locals
@_5::Union{Nothing, Tuple{CartesianIndex{2}, CartesianIndex{2}}}
#17::var"#17#18"{Int64}
I::CartesianIndex{2}
Body::MMatrix{3, 3, Bool, 9}
1 ─ %1  = Main.:(var"#17#18")::Core.Const(var"#17#18")
│   %2  = Core.typeof(D)::Core.Const(Int64)
│   %3  = Core.apply_type(%1, %2)::Core.Const(var"#17#18"{Int64})
│         (#17 = %new(%3, D))
│   %5  = #17::var"#17#18"{Int64}
│   %6  = Main.Val(\$(Expr(:static_parameter, 1)))::Core.Const(Val{2}())
│   %7  = Main.ntuple(%5, %6)::Tuple{Int64, Int64}
│   %8  = Main.CartesianIndices(%7)::CartesianIndices{2, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}}
│         (@_5 = Base.iterate(%8))
│   %10 = (@_5 === nothing)::Bool
│   %11 = Base.not_int(%10)::Bool
└──       goto #4 if not %11
2 ┄ %13 = @_5::Tuple{CartesianIndex{2}, CartesianIndex{2}}
│         (I = Core.getfield(%13, 1))
│   %15 = Core.getfield(%13, 2)::CartesianIndex{2}
│   %16 = Main.Tuple(I)::Tuple{Int64, Int64}
│   %17 = Core._apply_iterate(Base.iterate, f, %16)::Bool
│         Base.setindex!(A, %17, I)
│         (@_5 = Base.iterate(%8, %15))
│   %20 = (@_5 === nothing)::Bool
│   %21 = Base.not_int(%20)::Bool
└──       goto #4 if not %21
3 ─       goto #2
4 ┄       return A
``````

Ouch! `splatting_fnfill_nd!` is approximately 40 times slower than `fnfill_nd!`. What happened? `@btime` reports no allocations and `@code_warntype` suggests that `splatting_fnfill_nd!` is type stable, at least as far as it should be. The appearance of a local variable of type `Union{Nothing, Tuple{CartesianIndex{2}, CartesianIndex{2}}}`, AFAIK, is a result of calling `iterate(::CartesianIndices)` and is not usually a performance concern. Basically, the only difference here is where I chose to do the splatting, which suggests that the method of `kron2` or `kron4` which takes a `CartesianIndex` and then splats acts as a function barrier. However, I did not think that a function barrier would be necessary here as long as the splatted `Tuple` is of a fully inferred type.

The problem also occurs when using explicit looping.

``````function splatting_fnfill_2d!(A::AbstractArray, f, D::Int)
for j ∈ 1:D, i ∈ 1:D
I = CartesianIndex{2}(i,j)
A[I] = f(Tuple(I)...)
end
return A
end

let buf = MMatrix{3,3,Bool}(undef)
result = @btime splatting_fnfill_2d!(\$buf, kron2, 3)
@assert result == I[1:3, 1:3]
end;
``````
``````238.840 ns (0 allocations: 0 bytes)
``````

However, performance is recovered if one avoids splatting in the loop.

``````    function nosplatting_fnfill_2d!(A::AbstractArray, f, D::Int)
for j ∈ 1:D, i ∈ 1:D
I = CartesianIndex{2}(i,j)
A[I] = f(I, I)
end
return A
end

let buf = MMatrix{3,3,Bool}(undef)
result = @btime nosplatting_fnfill_2d!(\$buf, kron2, 3)
@assert result == I[1:3, 1:3]
end;
``````
``````6.572 ns (0 allocations: 0 bytes)
``````

It seems like the moral here is to avoid splatting in loops even when using type stable functions, but I do not understand why.

Thanks!
Edit: neatness.

1 Like

I’m sorry I haven’t read your entire post, but why are you supplying the size of the input array as an input parameter? Why not just

``````function foo!(A, f)
for i in CartesianIndices(A)
A[i] = f(i)
end
return A
end
``````

?

That should also specialize on statically sized arrays.

It’s often wise to explicitly specialize on any inner-loop function passed as an argument, and this resolves issue 3:

``````function splatting_fnfill_nd!(A::AbstractArray{<:Any,K}, f::F, D::Int) where {K, F}
for I ∈ CartesianIndices(ntuple(_ -> D, Val(K)))
A[I] = f(Tuple(I)...)
end
return A
end
``````
``````julia> let buf = MMatrix{3,3,Bool}(undef)
result = @btime splatting_fnfill_nd!(\$buf, kron2, 3)
@assert result == I[1:3, 1:3]
println()
@code_warntype splatting_fnfill_nd!(buf, kron2, 3)
end;
4.900 ns (0 allocations: 0 bytes)
``````
1 Like

I’m not really understanding why this would be faster than `CartesianIndices(buf)`, which infers as

``````CartesianIndices{2, Tuple{SOneTo{3}, SOneTo{3}}}
``````

Benchmarking indicates that it is faster, but why?

1 Like

I scanned the docs of `CartesianIndices` too quickly and failed to realize one could call CartesianIndices on an array and be done with it. Calling `CartesianIndices(A)` definitely seems like better style.

I am also puzzled as to why the performance is not as good as directly supplying the dimensions with `ntuple`, especially because the `CartesianIndices` object then loops over `SOneTo{3}` rather than `OneTo{Int64}`.

Thanks! That really helps. So, before, performance was dependent on Julia’s heuristics for when to specialize automatically. I saw a similar post where that was the issue, but the example was a little different than mine and I must have not quite understood.

Edit: …and here’s the relevant performance tip in the manual, in case anyone else runs into this.