Multidimensional algorithms built on `CartesianIndices`

seem to have significant performance overhead when applied to `StaticArrays.MArray`

s, but not regular `Array`

s. Writing several functions with nested loops for different dimensions (or leveraging the macros in `Base.Cartesian`

) recovers performance, but I am wondering why the more elegant `CartesianIndices`

solution is slower when used with `MArray`

.

To demonstrate the problem, consider an `N`

-dimensional algorithm that takes a function `f(i_1, ⋯, i_N)`

of the `N`

indices and uses it to fill the elements of an input array `arr`

.

```
function fillfn_cartesianindices!(arr::AbstractArray, f::F) where F # force specialization
@inbounds for I ∈ CartesianIndices(arr)
arr[I] = f(I)
end
return arr
end
```

Here is a similar `@generated`

function using the `Base.Cartesian`

macros.

```
using Base.Cartesian
@generated function fillfn_basecartesian!(arr::AbstractArray{<:Any,N}, f::F) where {N,F}
# Generate an expression of nested for loops:
# for i_N ∈ axes(arr, N), ..., i_1 ∈ axes(arr, 1)
# arr[i_1, ..., i_N] = f(i_1, ..., i_N)
# end
quote
@inbounds @nloops $N i arr begin
(@nref $N arr i) = (@ncall $N f i)
end
return arr
end
end
```

Finally, here is function of indices to feed to either of the above functions (for the four dimensional case).

```
indexfunc4d(i, j, k, l) = (i==k) * (j==l) - (i==l) * (j==k)
@inline indexfunc4d(I::CartesianIndex{4}) = indexfunc4d(Tuple(I)...)
```

Benchmarking on a 3×3×3×3 `Array`

shows that using `CartesianIndices`

has no performance overhead.

```
using BenchmarkTools
using StaticArrays
let dims = (3,3,3,3)
T = eltype(indexfunc4d(map(_ -> 1, dims)...))
println("Using `CartesianIndices`")
out1 = @btime(
fillfn_cartesianindices!(A, $indexfunc4d),
setup=(A = Array{$T}(undef, $dims))
)
println("Using `Base.Cartesian`")
out2 = @btime(
fillfn_basecartesian!(A, $indexfunc4d),
setup=(A = Array{$T}(undef, $dims))
)
@assert out1 == out2
end;
```

```
Using `CartesianIndices`
107.183 ns (0 allocations: 0 bytes)
Using `Base.Cartesian`
114.619 ns (0 allocations: 0 bytes)
```

A similar benchmark where `arr isa MArray`

shows that `fillfn_basecartesian!`

7x faster than `fillfn_cartesianindices!`

.

```
let SDims = NTuple{4,3}
dims = fieldtypes(SDims)
T = eltype(indexfunc4d(map(_ -> 1, dims)...)) # Element type
println("Using `CartesianIndices`")
out1 = @btime(
fillfn_cartesianindices!(A, $indexfunc4d),
setup=(A = MArray{$SDims,$T}(undef))
)
println("Using `Base.Cartesian`")
out2 = @btime(
fillfn_basecartesian!(A, $indexfunc4d),
setup=(A = MArray{$SDims,$T}(undef))
)
end;
```

```
Using `CartesianIndices`
94.878 ns (0 allocations: 0 bytes)
Using `Base.Cartesian`
12.877 ns (0 allocations: 0 bytes)
```

Note that using `StaticArrays.sacollect`

with `CartesianIndices`

seems to be efficient.

```
@btime StaticArrays.sacollect(SArray{NTuple{4,3}},
$indexfunc4d(I) for I ∈ CartesianIndices(ntuple(_ -> SOneTo(3), Val(4))));
@btime StaticArrays.sacollect(MArray{NTuple{4,3}},
$indexfunc4d(I) for I ∈ CartesianIndices(ntuple(_ -> SOneTo(3), Val(4))));
```

```
9.858 ns (0 allocations: 0 bytes)
87.575 ns (1 allocation: 672 bytes)
```

However, AFAIK, `sacollect`

does not perform a `for`

loop over `CartesianIndices`

at runtime but rather constructs an expression during compilation to construct the array “all-at-once”, as required by `SArray`

.