Multidimensional algorithms built on CartesianIndices seem to have significant performance overhead when applied to StaticArrays.MArrays, but not regular Arrays. Writing several functions with nested loops for different dimensions (or leveraging the macros in Base.Cartesian) recovers performance, but I am wondering why the more elegant CartesianIndices solution is slower when used with MArray.
To demonstrate the problem, consider an N-dimensional algorithm that takes a function f(i_1, ⋯, i_N) of the N indices and uses it to fill the elements of an input array arr.
function fillfn_cartesianindices!(arr::AbstractArray, f::F) where F # force specialization
@inbounds for I ∈ CartesianIndices(arr)
arr[I] = f(I)
end
return arr
end
Here is a similar @generated function using the Base.Cartesian macros.
using Base.Cartesian
@generated function fillfn_basecartesian!(arr::AbstractArray{<:Any,N}, f::F) where {N,F}
# Generate an expression of nested for loops:
# for i_N ∈ axes(arr, N), ..., i_1 ∈ axes(arr, 1)
# arr[i_1, ..., i_N] = f(i_1, ..., i_N)
# end
quote
@inbounds @nloops $N i arr begin
(@nref $N arr i) = (@ncall $N f i)
end
return arr
end
end
Finally, here is function of indices to feed to either of the above functions (for the four dimensional case).
indexfunc4d(i, j, k, l) = (i==k) * (j==l) - (i==l) * (j==k)
@inline indexfunc4d(I::CartesianIndex{4}) = indexfunc4d(Tuple(I)...)
Benchmarking on a 3×3×3×3 Array shows that using CartesianIndices has no performance overhead.
using BenchmarkTools
using StaticArrays
let dims = (3,3,3,3)
T = eltype(indexfunc4d(map(_ -> 1, dims)...))
println("Using `CartesianIndices`")
out1 = @btime(
fillfn_cartesianindices!(A, $indexfunc4d),
setup=(A = Array{$T}(undef, $dims))
)
println("Using `Base.Cartesian`")
out2 = @btime(
fillfn_basecartesian!(A, $indexfunc4d),
setup=(A = Array{$T}(undef, $dims))
)
@assert out1 == out2
end;
Using `CartesianIndices`
107.183 ns (0 allocations: 0 bytes)
Using `Base.Cartesian`
114.619 ns (0 allocations: 0 bytes)
A similar benchmark where arr isa MArray shows that fillfn_basecartesian! 7x faster than fillfn_cartesianindices!.
let SDims = NTuple{4,3}
dims = fieldtypes(SDims)
T = eltype(indexfunc4d(map(_ -> 1, dims)...)) # Element type
println("Using `CartesianIndices`")
out1 = @btime(
fillfn_cartesianindices!(A, $indexfunc4d),
setup=(A = MArray{$SDims,$T}(undef))
)
println("Using `Base.Cartesian`")
out2 = @btime(
fillfn_basecartesian!(A, $indexfunc4d),
setup=(A = MArray{$SDims,$T}(undef))
)
end;
Using `CartesianIndices`
94.878 ns (0 allocations: 0 bytes)
Using `Base.Cartesian`
12.877 ns (0 allocations: 0 bytes)
Note that using StaticArrays.sacollect with CartesianIndices seems to be efficient.
@btime StaticArrays.sacollect(SArray{NTuple{4,3}},
$indexfunc4d(I) for I ∈ CartesianIndices(ntuple(_ -> SOneTo(3), Val(4))));
@btime StaticArrays.sacollect(MArray{NTuple{4,3}},
$indexfunc4d(I) for I ∈ CartesianIndices(ntuple(_ -> SOneTo(3), Val(4))));
9.858 ns (0 allocations: 0 bytes)
87.575 ns (1 allocation: 672 bytes)
However, AFAIK, sacollect does not perform a for loop over CartesianIndices at runtime but rather constructs an expression during compilation to construct the array “all-at-once”, as required by SArray.