I am encountering a handful of performance issues using CartesianIndices
(versus using hard-coded nested for
loops) to write dimension-agnostic code:
- There is apparent runtime cost associated with using type parameters in code when it seems there shouldn’t be
- Iteration of
CartesianIndices
exhibits worse performance compared to hard-coded nested loops - Splatting
Tuple(I::CartesianIndices)
in the “wrong place” greatly degrades performance
I am hoping someone might help me better understand and avoid these issues, further explained below, as I am planning to use patterns like those below in some performance critical code. I am mainly concerned with arrays whose size
in each dimension is small (2-4) and whose ndims
typically range from 1 to 6.
I am using Julia v1.7.2.
Setup
The function fnfill_nd!(A::AbstractArray{K}, f, D::Int)
below populates a K
-dimensional input array A
of size (D, D, ⋯)
using a function f
on the indices (i,j,...)::Dims{K}
, i, j, ⋯ ∈ 1:D
. For example, if K=2
, D=3
, and f(i,j) = (i==j)
, the result of fnfill_nd!(A, ==, Val(2), 3)
is equivalent to the 3×3 identity matrix. fnfill_nd
has two methods: one where D
is passed to filleye_nd
as a regular variable, and one where D
is inferred from the size the input A::StaticArray
.
using BenchmarkTools
using StaticArrays
using LinearAlgebra
function fnfill_nd!(A::AbstractArray{<:Any,K}, f, D::Int) where K
for I ∈ CartesianIndices(ntuple(_ -> D, Val(K)))
A[I] = f(I)
end
return A
end
function fnfill_nd!(A::StaticArray{NTuple{K,D}}, f) where {K,D}
for I ∈ CartesianIndices(ntuple(_ -> D, Val(K)))
A[I] = f(I)
end
return A
end
Below, I define two simple functions on the indices of two- and four-dimensional arrays, respectively, to which we can apply fnfill_nd!
.
# The kronecker delta, δᵢⱼ
@inline kron2(i::Int, j::Int) = (i==j)
@inline kron2(I::CartesianIndex{2}) = kron2(Tuple(I)...)
# A product of two kronecker deltas: δᵢₖ δⱼₗ
@inline kron4(i,j,k,l) = (i==k) && (j==l)
@inline kron4(I::CartesianIndex{4}) = kron4(Tuple(I)...)
Issue 1: Apparent runtime cost of D
as a type parameter
Here are benchmarks of fnfill_nd!
applied to kron2
and kron4
.
println("Two-dimensional case, dimension-agnostic iteration over CartesianIndices")
let buf = MMatrix{3,3,Bool}(undef)
println(" Dimension size D specified as regular variable")
result = @btime fnfill_nd!($buf, kron2, 3)
@assert result == I[1:3, 1:3]
println(" Dimension size D infered as a static parameter of `buf`")
result = @btime fnfill_nd!($buf, kron2)
@assert result == I[1:3, 1:3]
end;
println()
println("Four-dimensional case, dimension-agnostic iteration over CartesianIndices")
let buf = MArray{NTuple{4,3}, Bool}(undef)
println(" Dimension size D passed as regular variable")
result = @btime fnfill_nd!($buf, kron4, 3)
println(" Dimension size D infered as a static parameter of `buf`")
result = @btime fnfill_nd!($buf, kron4)
end;
Two-dimensional case, dimension-agnostic iteration over CartesianIndices
Dimension size D specified as regular variable
5.587 ns (0 allocations: 0 bytes)
Dimension size D infered as a static parameter of `buf`
11.885 ns (0 allocations: 0 bytes)
Four-dimensional case, dimension-agnostic iteration over CartesianIndices
Dimension size D passed as regular variable
107.511 ns (0 allocations: 0 bytes)
Dimension size D infered as a static parameter of `buf`
100.239 ns (0 allocations: 0 bytes)
The first of the above results (for two dimensions) suggests an cost of using the type parameter D
versus a simple value. I am wondering if someone could help me understand where that cost is coming from.
Note that the latter result for 4 dimensions is inconsistent as to which method is faster, as the relative difference is small.
Issue 2: Performance of CartesianIndices
compared to hard-coded nested loops
Here are functions analogous to fnfill_nd!
, but with hard-coded loops for two and four dimensions.
function fnfill_2d!(A::AbstractArray, f, D::Int)
for j ∈ 1:D, i ∈ 1:D
I = CartesianIndex{2}(i,j)
A[I] = f(I)
end
return A
end
function fnfill_2d!(A::StaticArray{NTuple{2,D}}, f) where D
for j ∈ 1:D, i ∈ 1:D
I = CartesianIndex{2}(i,j)
A[I] = f(I)
end
return A
end
function fnfill_4d!(A::AbstractArray, f, D::Int)
for l ∈ 1:D, k ∈ 1:D, j ∈ 1:D, i ∈ 1:D
I = CartesianIndex{4}(i,j,k,l)
A[I] = f(I)
end
return A
end
function fnfill_4d!(A::StaticArray{NTuple{4,D}}, f) where D
for l ∈ 1:D, k ∈ 1:D, j ∈ 1:D, i ∈ 1:D
I = CartesianIndex{4}(i,j,k,l)
A[I] = f(I)
end
return A
end
We can benchmark these and compare to the analogous benchmarks of fnfill_nd
.
println("Two-dimensional case, hard-coded loops")
let buf = MMatrix{3,3,Bool}(undef)
println(" Dimension size D specified as regular variable")
result = @btime fnfill_2d!($buf, kron2, 3)
@assert result == I[1:3, 1:3]
println(" Dimension size D infered as a static parameter of `buf`")
result = @btime fnfill_2d!($buf, kron2)
@assert result == I[1:3, 1:3]
end;
println()
println("Four-dimensional case, hard-coded loops")
let buf = MArray{NTuple{4,3}, Bool}(undef)
println("Dimension size D passed as regular variable")
result = @btime fnfill_4d!($buf, kron4, 3)
println(" Dimension size D infered as a static parameter of `buf`")
result = @btime fnfill_4d!($buf, kron4)
@assert fnfill_4d!(buf, kron4) == fnfill_nd!(buf, kron4)
end;
Two-dimensional case, hard-coded loops
Dimension size D specified as regular variable
7.970 ns (0 allocations: 0 bytes)
Dimension size D infered as a static parameter of `buf`
2.374 ns (0 allocations: 0 bytes)
Four-dimensional case, hard-coded loops
Dimension size D passed as regular variable
68.196 ns (0 allocations: 0 bytes)
Dimension size D infered as a static parameter of `buf`
24.403 ns (0 allocations: 0 bytes)
Except for the first case, the hard-coded loops are significantly more efficient, and, in contrast to the previous case, using the type parameter D
is a boon to performance. My (inexpert) guess here is that the compiler is doing some optimization, like unrolling and/or applying some SIMD vectorization to the inner loop(s) because they have a known number of iterations, and, unfortunately, this optimization cannot be done through CartesianIndices
.
My question is then is there a way to have “elegant” dimension-independent code without sacrificing performance? Right now, it appears that if performance is paramount then one should avoid CartesianIndices
, and either manually write a separate method for each dimension or use metaprogramming to help automate this task (which I have so far tried to avoid). Alternatively, maybe some statically-sized analogy of CartesianIndices
would do.
Issue 3: Using splatting in the wrong place greatly degrades performance
Consider a subtly different version of fnfill_nd!
that invokes f(Tuple(I::CartesianIndices)...)
rather than f(I)
.
function splatting_fnfill_nd!(A::AbstractArray{<:Any,K}, f, D::Int) where K
for I ∈ CartesianIndices(ntuple(_ -> D, Val(K)))
A[I] = f(Tuple(I)...)
end
return A
end
let buf = MMatrix{3,3,Bool}(undef)
result = @btime splatting_fnfill_nd!($buf, kron2, 3)
@assert result == I[1:3, 1:3]
println()
@code_warntype splatting_fnfill_nd!(buf, kron2, 3)
end;
226.983 ns (0 allocations: 0 bytes)
MethodInstance for splatting_fnfill_nd!(::MMatrix{3, 3, Bool, 9}, ::typeof(kron2), ::Int64)
from splatting_fnfill_nd!(A::AbstractArray{<:Any, K}, f, D::Int64) where K in Main at In[20]:1
Static Parameters
K = 2
Arguments
#self#::Core.Const(splatting_fnfill_nd!)
A::MMatrix{3, 3, Bool, 9}
f::Core.Const(kron2)
D::Int64
Locals
@_5::Union{Nothing, Tuple{CartesianIndex{2}, CartesianIndex{2}}}
#17::var"#17#18"{Int64}
I::CartesianIndex{2}
Body::MMatrix{3, 3, Bool, 9}
1 ─ %1 = Main.:(var"#17#18")::Core.Const(var"#17#18")
│ %2 = Core.typeof(D)::Core.Const(Int64)
│ %3 = Core.apply_type(%1, %2)::Core.Const(var"#17#18"{Int64})
│ (#17 = %new(%3, D))
│ %5 = #17::var"#17#18"{Int64}
│ %6 = Main.Val($(Expr(:static_parameter, 1)))::Core.Const(Val{2}())
│ %7 = Main.ntuple(%5, %6)::Tuple{Int64, Int64}
│ %8 = Main.CartesianIndices(%7)::CartesianIndices{2, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}}
│ (@_5 = Base.iterate(%8))
│ %10 = (@_5 === nothing)::Bool
│ %11 = Base.not_int(%10)::Bool
└── goto #4 if not %11
2 ┄ %13 = @_5::Tuple{CartesianIndex{2}, CartesianIndex{2}}
│ (I = Core.getfield(%13, 1))
│ %15 = Core.getfield(%13, 2)::CartesianIndex{2}
│ %16 = Main.Tuple(I)::Tuple{Int64, Int64}
│ %17 = Core._apply_iterate(Base.iterate, f, %16)::Bool
│ Base.setindex!(A, %17, I)
│ (@_5 = Base.iterate(%8, %15))
│ %20 = (@_5 === nothing)::Bool
│ %21 = Base.not_int(%20)::Bool
└── goto #4 if not %21
3 ─ goto #2
4 ┄ return A
Ouch! splatting_fnfill_nd!
is approximately 40 times slower than fnfill_nd!
. What happened? @btime
reports no allocations and @code_warntype
suggests that splatting_fnfill_nd!
is type stable, at least as far as it should be. The appearance of a local variable of type Union{Nothing, Tuple{CartesianIndex{2}, CartesianIndex{2}}}
, AFAIK, is a result of calling iterate(::CartesianIndices)
and is not usually a performance concern. Basically, the only difference here is where I chose to do the splatting, which suggests that the method of kron2
or kron4
which takes a CartesianIndex
and then splats acts as a function barrier. However, I did not think that a function barrier would be necessary here as long as the splatted Tuple
is of a fully inferred type.
The problem also occurs when using explicit looping.
function splatting_fnfill_2d!(A::AbstractArray, f, D::Int)
for j ∈ 1:D, i ∈ 1:D
I = CartesianIndex{2}(i,j)
A[I] = f(Tuple(I)...)
end
return A
end
let buf = MMatrix{3,3,Bool}(undef)
result = @btime splatting_fnfill_2d!($buf, kron2, 3)
@assert result == I[1:3, 1:3]
end;
238.840 ns (0 allocations: 0 bytes)
However, performance is recovered if one avoids splatting in the loop.
function nosplatting_fnfill_2d!(A::AbstractArray, f, D::Int)
for j ∈ 1:D, i ∈ 1:D
I = CartesianIndex{2}(i,j)
A[I] = f(I[1], I[2])
end
return A
end
let buf = MMatrix{3,3,Bool}(undef)
result = @btime nosplatting_fnfill_2d!($buf, kron2, 3)
@assert result == I[1:3, 1:3]
end;
6.572 ns (0 allocations: 0 bytes)
It seems like the moral here is to avoid splatting in loops even when using type stable functions, but I do not understand why.
Thanks!
Edit: neatness.