I am encountering a handful of performance issues using CartesianIndices (versus using hard-coded nested for loops) to write dimension-agnostic code:
- There is apparent runtime cost associated with using type parameters in code when it seems there shouldn’t be
- Iteration of
CartesianIndicesexhibits worse performance compared to hard-coded nested loops - Splatting
Tuple(I::CartesianIndices)in the “wrong place” greatly degrades performance
I am hoping someone might help me better understand and avoid these issues, further explained below, as I am planning to use patterns like those below in some performance critical code. I am mainly concerned with arrays whose size in each dimension is small (2-4) and whose ndims typically range from 1 to 6.
I am using Julia v1.7.2.
Setup
The function fnfill_nd!(A::AbstractArray{K}, f, D::Int) below populates a K-dimensional input array A of size (D, D, ⋯) using a function f on the indices (i,j,...)::Dims{K}, i, j, ⋯ ∈ 1:D. For example, if K=2, D=3, and f(i,j) = (i==j), the result of fnfill_nd!(A, ==, Val(2), 3) is equivalent to the 3×3 identity matrix. fnfill_nd has two methods: one where D is passed to filleye_nd as a regular variable, and one where D is inferred from the size the input A::StaticArray.
using BenchmarkTools
using StaticArrays
using LinearAlgebra
function fnfill_nd!(A::AbstractArray{<:Any,K}, f, D::Int) where K
for I ∈ CartesianIndices(ntuple(_ -> D, Val(K)))
A[I] = f(I)
end
return A
end
function fnfill_nd!(A::StaticArray{NTuple{K,D}}, f) where {K,D}
for I ∈ CartesianIndices(ntuple(_ -> D, Val(K)))
A[I] = f(I)
end
return A
end
Below, I define two simple functions on the indices of two- and four-dimensional arrays, respectively, to which we can apply fnfill_nd!.
# The kronecker delta, δᵢⱼ
@inline kron2(i::Int, j::Int) = (i==j)
@inline kron2(I::CartesianIndex{2}) = kron2(Tuple(I)...)
# A product of two kronecker deltas: δᵢₖ δⱼₗ
@inline kron4(i,j,k,l) = (i==k) && (j==l)
@inline kron4(I::CartesianIndex{4}) = kron4(Tuple(I)...)
Issue 1: Apparent runtime cost of D as a type parameter
Here are benchmarks of fnfill_nd! applied to kron2 and kron4.
println("Two-dimensional case, dimension-agnostic iteration over CartesianIndices")
let buf = MMatrix{3,3,Bool}(undef)
println(" Dimension size D specified as regular variable")
result = @btime fnfill_nd!($buf, kron2, 3)
@assert result == I[1:3, 1:3]
println(" Dimension size D infered as a static parameter of `buf`")
result = @btime fnfill_nd!($buf, kron2)
@assert result == I[1:3, 1:3]
end;
println()
println("Four-dimensional case, dimension-agnostic iteration over CartesianIndices")
let buf = MArray{NTuple{4,3}, Bool}(undef)
println(" Dimension size D passed as regular variable")
result = @btime fnfill_nd!($buf, kron4, 3)
println(" Dimension size D infered as a static parameter of `buf`")
result = @btime fnfill_nd!($buf, kron4)
end;
Two-dimensional case, dimension-agnostic iteration over CartesianIndices
Dimension size D specified as regular variable
5.587 ns (0 allocations: 0 bytes)
Dimension size D infered as a static parameter of `buf`
11.885 ns (0 allocations: 0 bytes)
Four-dimensional case, dimension-agnostic iteration over CartesianIndices
Dimension size D passed as regular variable
107.511 ns (0 allocations: 0 bytes)
Dimension size D infered as a static parameter of `buf`
100.239 ns (0 allocations: 0 bytes)
The first of the above results (for two dimensions) suggests an cost of using the type parameter D versus a simple value. I am wondering if someone could help me understand where that cost is coming from.
Note that the latter result for 4 dimensions is inconsistent as to which method is faster, as the relative difference is small.
Issue 2: Performance of CartesianIndices compared to hard-coded nested loops
Here are functions analogous to fnfill_nd!, but with hard-coded loops for two and four dimensions.
function fnfill_2d!(A::AbstractArray, f, D::Int)
for j ∈ 1:D, i ∈ 1:D
I = CartesianIndex{2}(i,j)
A[I] = f(I)
end
return A
end
function fnfill_2d!(A::StaticArray{NTuple{2,D}}, f) where D
for j ∈ 1:D, i ∈ 1:D
I = CartesianIndex{2}(i,j)
A[I] = f(I)
end
return A
end
function fnfill_4d!(A::AbstractArray, f, D::Int)
for l ∈ 1:D, k ∈ 1:D, j ∈ 1:D, i ∈ 1:D
I = CartesianIndex{4}(i,j,k,l)
A[I] = f(I)
end
return A
end
function fnfill_4d!(A::StaticArray{NTuple{4,D}}, f) where D
for l ∈ 1:D, k ∈ 1:D, j ∈ 1:D, i ∈ 1:D
I = CartesianIndex{4}(i,j,k,l)
A[I] = f(I)
end
return A
end
We can benchmark these and compare to the analogous benchmarks of fnfill_nd.
println("Two-dimensional case, hard-coded loops")
let buf = MMatrix{3,3,Bool}(undef)
println(" Dimension size D specified as regular variable")
result = @btime fnfill_2d!($buf, kron2, 3)
@assert result == I[1:3, 1:3]
println(" Dimension size D infered as a static parameter of `buf`")
result = @btime fnfill_2d!($buf, kron2)
@assert result == I[1:3, 1:3]
end;
println()
println("Four-dimensional case, hard-coded loops")
let buf = MArray{NTuple{4,3}, Bool}(undef)
println("Dimension size D passed as regular variable")
result = @btime fnfill_4d!($buf, kron4, 3)
println(" Dimension size D infered as a static parameter of `buf`")
result = @btime fnfill_4d!($buf, kron4)
@assert fnfill_4d!(buf, kron4) == fnfill_nd!(buf, kron4)
end;
Two-dimensional case, hard-coded loops
Dimension size D specified as regular variable
7.970 ns (0 allocations: 0 bytes)
Dimension size D infered as a static parameter of `buf`
2.374 ns (0 allocations: 0 bytes)
Four-dimensional case, hard-coded loops
Dimension size D passed as regular variable
68.196 ns (0 allocations: 0 bytes)
Dimension size D infered as a static parameter of `buf`
24.403 ns (0 allocations: 0 bytes)
Except for the first case, the hard-coded loops are significantly more efficient, and, in contrast to the previous case, using the type parameter D is a boon to performance. My (inexpert) guess here is that the compiler is doing some optimization, like unrolling and/or applying some SIMD vectorization to the inner loop(s) because they have a known number of iterations, and, unfortunately, this optimization cannot be done through CartesianIndices.
My question is then is there a way to have “elegant” dimension-independent code without sacrificing performance? Right now, it appears that if performance is paramount then one should avoid CartesianIndices, and either manually write a separate method for each dimension or use metaprogramming to help automate this task (which I have so far tried to avoid). Alternatively, maybe some statically-sized analogy of CartesianIndices would do.
Issue 3: Using splatting in the wrong place greatly degrades performance
Consider a subtly different version of fnfill_nd! that invokes f(Tuple(I::CartesianIndices)...) rather than f(I).
function splatting_fnfill_nd!(A::AbstractArray{<:Any,K}, f, D::Int) where K
for I ∈ CartesianIndices(ntuple(_ -> D, Val(K)))
A[I] = f(Tuple(I)...)
end
return A
end
let buf = MMatrix{3,3,Bool}(undef)
result = @btime splatting_fnfill_nd!($buf, kron2, 3)
@assert result == I[1:3, 1:3]
println()
@code_warntype splatting_fnfill_nd!(buf, kron2, 3)
end;
226.983 ns (0 allocations: 0 bytes)
MethodInstance for splatting_fnfill_nd!(::MMatrix{3, 3, Bool, 9}, ::typeof(kron2), ::Int64)
from splatting_fnfill_nd!(A::AbstractArray{<:Any, K}, f, D::Int64) where K in Main at In[20]:1
Static Parameters
K = 2
Arguments
#self#::Core.Const(splatting_fnfill_nd!)
A::MMatrix{3, 3, Bool, 9}
f::Core.Const(kron2)
D::Int64
Locals
@_5::Union{Nothing, Tuple{CartesianIndex{2}, CartesianIndex{2}}}
#17::var"#17#18"{Int64}
I::CartesianIndex{2}
Body::MMatrix{3, 3, Bool, 9}
1 ─ %1 = Main.:(var"#17#18")::Core.Const(var"#17#18")
│ %2 = Core.typeof(D)::Core.Const(Int64)
│ %3 = Core.apply_type(%1, %2)::Core.Const(var"#17#18"{Int64})
│ (#17 = %new(%3, D))
│ %5 = #17::var"#17#18"{Int64}
│ %6 = Main.Val($(Expr(:static_parameter, 1)))::Core.Const(Val{2}())
│ %7 = Main.ntuple(%5, %6)::Tuple{Int64, Int64}
│ %8 = Main.CartesianIndices(%7)::CartesianIndices{2, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}}
│ (@_5 = Base.iterate(%8))
│ %10 = (@_5 === nothing)::Bool
│ %11 = Base.not_int(%10)::Bool
└── goto #4 if not %11
2 ┄ %13 = @_5::Tuple{CartesianIndex{2}, CartesianIndex{2}}
│ (I = Core.getfield(%13, 1))
│ %15 = Core.getfield(%13, 2)::CartesianIndex{2}
│ %16 = Main.Tuple(I)::Tuple{Int64, Int64}
│ %17 = Core._apply_iterate(Base.iterate, f, %16)::Bool
│ Base.setindex!(A, %17, I)
│ (@_5 = Base.iterate(%8, %15))
│ %20 = (@_5 === nothing)::Bool
│ %21 = Base.not_int(%20)::Bool
└── goto #4 if not %21
3 ─ goto #2
4 ┄ return A
Ouch! splatting_fnfill_nd! is approximately 40 times slower than fnfill_nd!. What happened? @btime reports no allocations and @code_warntype suggests that splatting_fnfill_nd! is type stable, at least as far as it should be. The appearance of a local variable of type Union{Nothing, Tuple{CartesianIndex{2}, CartesianIndex{2}}}, AFAIK, is a result of calling iterate(::CartesianIndices) and is not usually a performance concern. Basically, the only difference here is where I chose to do the splatting, which suggests that the method of kron2 or kron4 which takes a CartesianIndex and then splats acts as a function barrier. However, I did not think that a function barrier would be necessary here as long as the splatted Tuple is of a fully inferred type.
The problem also occurs when using explicit looping.
function splatting_fnfill_2d!(A::AbstractArray, f, D::Int)
for j ∈ 1:D, i ∈ 1:D
I = CartesianIndex{2}(i,j)
A[I] = f(Tuple(I)...)
end
return A
end
let buf = MMatrix{3,3,Bool}(undef)
result = @btime splatting_fnfill_2d!($buf, kron2, 3)
@assert result == I[1:3, 1:3]
end;
238.840 ns (0 allocations: 0 bytes)
However, performance is recovered if one avoids splatting in the loop.
function nosplatting_fnfill_2d!(A::AbstractArray, f, D::Int)
for j ∈ 1:D, i ∈ 1:D
I = CartesianIndex{2}(i,j)
A[I] = f(I[1], I[2])
end
return A
end
let buf = MMatrix{3,3,Bool}(undef)
result = @btime nosplatting_fnfill_2d!($buf, kron2, 3)
@assert result == I[1:3, 1:3]
end;
6.572 ns (0 allocations: 0 bytes)
It seems like the moral here is to avoid splatting in loops even when using type stable functions, but I do not understand why.
Thanks!
Edit: neatness.