Multidimensional algorithms built on CartesianIndices
seem to have significant performance overhead when applied to StaticArrays.MArray
s, but not regular Array
s. Writing several functions with nested loops for different dimensions (or leveraging the macros in Base.Cartesian
) recovers performance, but I am wondering why the more elegant CartesianIndices
solution is slower when used with MArray
.
To demonstrate the problem, consider an N
-dimensional algorithm that takes a function f(i_1, ⋯, i_N)
of the N
indices and uses it to fill the elements of an input array arr
.
function fillfn_cartesianindices!(arr::AbstractArray, f::F) where F # force specialization
@inbounds for I ∈ CartesianIndices(arr)
arr[I] = f(I)
end
return arr
end
Here is a similar @generated
function using the Base.Cartesian
macros.
using Base.Cartesian
@generated function fillfn_basecartesian!(arr::AbstractArray{<:Any,N}, f::F) where {N,F}
# Generate an expression of nested for loops:
# for i_N ∈ axes(arr, N), ..., i_1 ∈ axes(arr, 1)
# arr[i_1, ..., i_N] = f(i_1, ..., i_N)
# end
quote
@inbounds @nloops $N i arr begin
(@nref $N arr i) = (@ncall $N f i)
end
return arr
end
end
Finally, here is function of indices to feed to either of the above functions (for the four dimensional case).
indexfunc4d(i, j, k, l) = (i==k) * (j==l) - (i==l) * (j==k)
@inline indexfunc4d(I::CartesianIndex{4}) = indexfunc4d(Tuple(I)...)
Benchmarking on a 3×3×3×3 Array
shows that using CartesianIndices
has no performance overhead.
using BenchmarkTools
using StaticArrays
let dims = (3,3,3,3)
T = eltype(indexfunc4d(map(_ -> 1, dims)...))
println("Using `CartesianIndices`")
out1 = @btime(
fillfn_cartesianindices!(A, $indexfunc4d),
setup=(A = Array{$T}(undef, $dims))
)
println("Using `Base.Cartesian`")
out2 = @btime(
fillfn_basecartesian!(A, $indexfunc4d),
setup=(A = Array{$T}(undef, $dims))
)
@assert out1 == out2
end;
Using `CartesianIndices`
107.183 ns (0 allocations: 0 bytes)
Using `Base.Cartesian`
114.619 ns (0 allocations: 0 bytes)
A similar benchmark where arr isa MArray
shows that fillfn_basecartesian!
7x faster than fillfn_cartesianindices!
.
let SDims = NTuple{4,3}
dims = fieldtypes(SDims)
T = eltype(indexfunc4d(map(_ -> 1, dims)...)) # Element type
println("Using `CartesianIndices`")
out1 = @btime(
fillfn_cartesianindices!(A, $indexfunc4d),
setup=(A = MArray{$SDims,$T}(undef))
)
println("Using `Base.Cartesian`")
out2 = @btime(
fillfn_basecartesian!(A, $indexfunc4d),
setup=(A = MArray{$SDims,$T}(undef))
)
end;
Using `CartesianIndices`
94.878 ns (0 allocations: 0 bytes)
Using `Base.Cartesian`
12.877 ns (0 allocations: 0 bytes)
Note that using StaticArrays.sacollect
with CartesianIndices
seems to be efficient.
@btime StaticArrays.sacollect(SArray{NTuple{4,3}},
$indexfunc4d(I) for I ∈ CartesianIndices(ntuple(_ -> SOneTo(3), Val(4))));
@btime StaticArrays.sacollect(MArray{NTuple{4,3}},
$indexfunc4d(I) for I ∈ CartesianIndices(ntuple(_ -> SOneTo(3), Val(4))));
9.858 ns (0 allocations: 0 bytes)
87.575 ns (1 allocation: 672 bytes)
However, AFAIK, sacollect
does not perform a for
loop over CartesianIndices
at runtime but rather constructs an expression during compilation to construct the array “all-at-once”, as required by SArray
.