Performance issues with CartesianIndices

I am encountering a handful of performance issues using CartesianIndices (versus using hard-coded nested for loops) to write dimension-agnostic code:

  1. There is apparent runtime cost associated with using type parameters in code when it seems there shouldn’t be
  2. Iteration of CartesianIndices exhibits worse performance compared to hard-coded nested loops
  3. Splatting Tuple(I::CartesianIndices) in the “wrong place” greatly degrades performance

I am hoping someone might help me better understand and avoid these issues, further explained below, as I am planning to use patterns like those below in some performance critical code. I am mainly concerned with arrays whose size in each dimension is small (2-4) and whose ndims typically range from 1 to 6.

I am using Julia v1.7.2.

Setup

The function fnfill_nd!(A::AbstractArray{K}, f, D::Int) below populates a K-dimensional input array A of size (D, D, ⋯) using a function f on the indices (i,j,...)::Dims{K}, i, j, ⋯ ∈ 1:D. For example, if K=2, D=3, and f(i,j) = (i==j), the result of fnfill_nd!(A, ==, Val(2), 3) is equivalent to the 3×3 identity matrix. fnfill_nd has two methods: one where D is passed to filleye_nd as a regular variable, and one where D is inferred from the size the input A::StaticArray.

using BenchmarkTools
using StaticArrays
using LinearAlgebra

function fnfill_nd!(A::AbstractArray{<:Any,K}, f, D::Int) where K
    for I ∈ CartesianIndices(ntuple(_ -> D, Val(K)))
        A[I] = f(I)
    end
    return A
end

function fnfill_nd!(A::StaticArray{NTuple{K,D}}, f) where {K,D}
    for I ∈ CartesianIndices(ntuple(_ -> D, Val(K)))
        A[I] = f(I)
    end
    return A
end

Below, I define two simple functions on the indices of two- and four-dimensional arrays, respectively, to which we can apply fnfill_nd!.

# The kronecker delta, δᵢⱼ
@inline kron2(i::Int, j::Int) = (i==j)
@inline kron2(I::CartesianIndex{2}) = kron2(Tuple(I)...)
# A product of two kronecker deltas: δᵢₖ δⱼₗ
@inline kron4(i,j,k,l) = (i==k) && (j==l)
@inline kron4(I::CartesianIndex{4}) = kron4(Tuple(I)...)

Issue 1: Apparent runtime cost of D as a type parameter

Here are benchmarks of fnfill_nd! applied to kron2 and kron4.

println("Two-dimensional case, dimension-agnostic iteration over CartesianIndices")
let buf = MMatrix{3,3,Bool}(undef)
    println("  Dimension size D specified as regular variable")
    result = @btime fnfill_nd!($buf, kron2, 3)
    @assert result == I[1:3, 1:3]

    println("  Dimension size D infered as a static parameter of `buf`")
    result = @btime fnfill_nd!($buf, kron2)
    @assert result == I[1:3, 1:3]
end;
println()

println("Four-dimensional case, dimension-agnostic iteration over CartesianIndices")
let buf = MArray{NTuple{4,3}, Bool}(undef)
    println("  Dimension size D passed as regular variable")
    result = @btime fnfill_nd!($buf, kron4, 3)

    println("  Dimension size D infered as a static parameter of `buf`")
    result = @btime fnfill_nd!($buf, kron4)
end;
    Two-dimensional case, dimension-agnostic iteration over CartesianIndices
      Dimension size D specified as regular variable
      5.587 ns (0 allocations: 0 bytes)
      Dimension size D infered as a static parameter of `buf`
      11.885 ns (0 allocations: 0 bytes)
    
    Four-dimensional case, dimension-agnostic iteration over CartesianIndices
      Dimension size D passed as regular variable
      107.511 ns (0 allocations: 0 bytes)
      Dimension size D infered as a static parameter of `buf`
      100.239 ns (0 allocations: 0 bytes)

The first of the above results (for two dimensions) suggests an cost of using the type parameter D versus a simple value. I am wondering if someone could help me understand where that cost is coming from.

Note that the latter result for 4 dimensions is inconsistent as to which method is faster, as the relative difference is small.

Issue 2: Performance of CartesianIndices compared to hard-coded nested loops

Here are functions analogous to fnfill_nd!, but with hard-coded loops for two and four dimensions.

function fnfill_2d!(A::AbstractArray, f, D::Int)
    for j ∈ 1:D, i ∈ 1:D
        I = CartesianIndex{2}(i,j)
        A[I] = f(I)
    end
    return A
end

function fnfill_2d!(A::StaticArray{NTuple{2,D}}, f) where D
    for j ∈ 1:D, i ∈ 1:D
        I = CartesianIndex{2}(i,j)
        A[I] = f(I)
    end
    return A
end

function fnfill_4d!(A::AbstractArray, f, D::Int)
    for l ∈ 1:D, k ∈ 1:D, j ∈ 1:D, i ∈ 1:D
        I = CartesianIndex{4}(i,j,k,l)
        A[I] = f(I)
    end
    return A
end

function fnfill_4d!(A::StaticArray{NTuple{4,D}}, f) where D
    for l ∈ 1:D, k ∈ 1:D, j ∈ 1:D, i ∈ 1:D
        I = CartesianIndex{4}(i,j,k,l)
        A[I] = f(I)
    end
    return A
end

We can benchmark these and compare to the analogous benchmarks of fnfill_nd.

println("Two-dimensional case, hard-coded loops")
let buf = MMatrix{3,3,Bool}(undef)
    println("  Dimension size D specified as regular variable")
    result = @btime fnfill_2d!($buf, kron2, 3)
    @assert result == I[1:3, 1:3]

    println("  Dimension size D infered as a static parameter of `buf`")
    result = @btime fnfill_2d!($buf, kron2)
    @assert result == I[1:3, 1:3]
end;
println()

println("Four-dimensional case, hard-coded loops")
let buf = MArray{NTuple{4,3}, Bool}(undef)

    println("Dimension size D passed as regular variable")
    result = @btime fnfill_4d!($buf, kron4, 3)

    println("  Dimension size D infered as a static parameter of `buf`")
    result = @btime fnfill_4d!($buf, kron4)

    @assert fnfill_4d!(buf, kron4) == fnfill_nd!(buf, kron4)
end;
Two-dimensional case, hard-coded loops
  Dimension size D specified as regular variable
  7.970 ns (0 allocations: 0 bytes)
  Dimension size D infered as a static parameter of `buf`
  2.374 ns (0 allocations: 0 bytes)

Four-dimensional case, hard-coded loops
Dimension size D passed as regular variable
  68.196 ns (0 allocations: 0 bytes)
  Dimension size D infered as a static parameter of `buf`
  24.403 ns (0 allocations: 0 bytes)

Except for the first case, the hard-coded loops are significantly more efficient, and, in contrast to the previous case, using the type parameter D is a boon to performance. My (inexpert) guess here is that the compiler is doing some optimization, like unrolling and/or applying some SIMD vectorization to the inner loop(s) because they have a known number of iterations, and, unfortunately, this optimization cannot be done through CartesianIndices.

My question is then is there a way to have “elegant” dimension-independent code without sacrificing performance? Right now, it appears that if performance is paramount then one should avoid CartesianIndices, and either manually write a separate method for each dimension or use metaprogramming to help automate this task (which I have so far tried to avoid). Alternatively, maybe some statically-sized analogy of CartesianIndices would do.

Issue 3: Using splatting in the wrong place greatly degrades performance

Consider a subtly different version of fnfill_nd! that invokes f(Tuple(I::CartesianIndices)...) rather than f(I).

function splatting_fnfill_nd!(A::AbstractArray{<:Any,K}, f, D::Int) where K
    for I ∈ CartesianIndices(ntuple(_ -> D, Val(K)))
        A[I] = f(Tuple(I)...)
    end
    return A
end

let buf = MMatrix{3,3,Bool}(undef)
    result = @btime splatting_fnfill_nd!($buf, kron2, 3)
    @assert result == I[1:3, 1:3]
    println()
    @code_warntype splatting_fnfill_nd!(buf, kron2, 3)
end;
      226.983 ns (0 allocations: 0 bytes)
    
    MethodInstance for splatting_fnfill_nd!(::MMatrix{3, 3, Bool, 9}, ::typeof(kron2), ::Int64)
      from splatting_fnfill_nd!(A::AbstractArray{<:Any, K}, f, D::Int64) where K in Main at In[20]:1
    Static Parameters
      K = 2
    Arguments
      #self#::Core.Const(splatting_fnfill_nd!)
      A::MMatrix{3, 3, Bool, 9}
      f::Core.Const(kron2)
      D::Int64
    Locals
      @_5::Union{Nothing, Tuple{CartesianIndex{2}, CartesianIndex{2}}}
      #17::var"#17#18"{Int64}
      I::CartesianIndex{2}
    Body::MMatrix{3, 3, Bool, 9}
    1 ─ %1  = Main.:(var"#17#18")::Core.Const(var"#17#18")
    │   %2  = Core.typeof(D)::Core.Const(Int64)
    │   %3  = Core.apply_type(%1, %2)::Core.Const(var"#17#18"{Int64})
    │         (#17 = %new(%3, D))
    │   %5  = #17::var"#17#18"{Int64}
    │   %6  = Main.Val($(Expr(:static_parameter, 1)))::Core.Const(Val{2}())
    │   %7  = Main.ntuple(%5, %6)::Tuple{Int64, Int64}
    │   %8  = Main.CartesianIndices(%7)::CartesianIndices{2, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}}
    │         (@_5 = Base.iterate(%8))
    │   %10 = (@_5 === nothing)::Bool
    │   %11 = Base.not_int(%10)::Bool
    └──       goto #4 if not %11
    2 ┄ %13 = @_5::Tuple{CartesianIndex{2}, CartesianIndex{2}}
    │         (I = Core.getfield(%13, 1))
    │   %15 = Core.getfield(%13, 2)::CartesianIndex{2}
    │   %16 = Main.Tuple(I)::Tuple{Int64, Int64}
    │   %17 = Core._apply_iterate(Base.iterate, f, %16)::Bool
    │         Base.setindex!(A, %17, I)
    │         (@_5 = Base.iterate(%8, %15))
    │   %20 = (@_5 === nothing)::Bool
    │   %21 = Base.not_int(%20)::Bool
    └──       goto #4 if not %21
    3 ─       goto #2
    4 ┄       return A

Ouch! splatting_fnfill_nd! is approximately 40 times slower than fnfill_nd!. What happened? @btime reports no allocations and @code_warntype suggests that splatting_fnfill_nd! is type stable, at least as far as it should be. The appearance of a local variable of type Union{Nothing, Tuple{CartesianIndex{2}, CartesianIndex{2}}}, AFAIK, is a result of calling iterate(::CartesianIndices) and is not usually a performance concern. Basically, the only difference here is where I chose to do the splatting, which suggests that the method of kron2 or kron4 which takes a CartesianIndex and then splats acts as a function barrier. However, I did not think that a function barrier would be necessary here as long as the splatted Tuple is of a fully inferred type.

The problem also occurs when using explicit looping.

function splatting_fnfill_2d!(A::AbstractArray, f, D::Int)
    for j ∈ 1:D, i ∈ 1:D
        I = CartesianIndex{2}(i,j)
        A[I] = f(Tuple(I)...)
    end
    return A
end

let buf = MMatrix{3,3,Bool}(undef)
    result = @btime splatting_fnfill_2d!($buf, kron2, 3)
    @assert result == I[1:3, 1:3]
end;
238.840 ns (0 allocations: 0 bytes)

However, performance is recovered if one avoids splatting in the loop.

    function nosplatting_fnfill_2d!(A::AbstractArray, f, D::Int)
        for j ∈ 1:D, i ∈ 1:D
            I = CartesianIndex{2}(i,j)
            A[I] = f(I[1], I[2])
        end
        return A
    end
    
    let buf = MMatrix{3,3,Bool}(undef)
        result = @btime nosplatting_fnfill_2d!($buf, kron2, 3)
        @assert result == I[1:3, 1:3]
    end;
6.572 ns (0 allocations: 0 bytes)

It seems like the moral here is to avoid splatting in loops even when using type stable functions, but I do not understand why.

Thanks!
Edit: neatness.

1 Like

I’m sorry I haven’t read your entire post, but why are you supplying the size of the input array as an input parameter? Why not just

function foo!(A, f)
    for i in CartesianIndices(A)
        A[i] = f(i)
    end
    return A
end

?

That should also specialize on statically sized arrays.

It’s often wise to explicitly specialize on any inner-loop function passed as an argument, and this resolves issue 3:

function splatting_fnfill_nd!(A::AbstractArray{<:Any,K}, f::F, D::Int) where {K, F}
    for I ∈ CartesianIndices(ntuple(_ -> D, Val(K)))
        A[I] = f(Tuple(I)...)
    end
    return A
end
julia> let buf = MMatrix{3,3,Bool}(undef)
           result = @btime splatting_fnfill_nd!($buf, kron2, 3)
           @assert result == I[1:3, 1:3]
           println()
           @code_warntype splatting_fnfill_nd!(buf, kron2, 3)
       end;
  4.900 ns (0 allocations: 0 bytes)
1 Like

I’m not really understanding why this would be faster than CartesianIndices(buf), which infers as

CartesianIndices{2, Tuple{SOneTo{3}, SOneTo{3}}}

Benchmarking indicates that it is faster, but why?

1 Like

I scanned the docs of CartesianIndices too quickly and failed to realize one could call CartesianIndices on an array and be done with it. Calling CartesianIndices(A) definitely seems like better style.

I am also puzzled as to why the performance is not as good as directly supplying the dimensions with ntuple, especially because the CartesianIndices object then loops over SOneTo{3} rather than OneTo{Int64}.

Thanks! That really helps. So, before, performance was dependent on Julia’s heuristics for when to specialize automatically. I saw a similar post where that was the issue, but the example was a little different than mine and I must have not quite understood.

Edit: …and here’s the relevant performance tip in the manual, in case anyone else runs into this.