Performance of using CartesianIndices with MArray

Multidimensional algorithms built on CartesianIndices seem to have significant performance overhead when applied to StaticArrays.MArray​s, but not regular Array​s. Writing several functions with nested loops for different dimensions (or leveraging the macros in Base.Cartesian) recovers performance, but I am wondering why the more elegant CartesianIndices solution is slower when used with MArray.

To demonstrate the problem, consider an N-dimensional algorithm that takes a function f(i_1, ⋯, i_N) of the N indices and uses it to fill the elements of an input array arr.

function fillfn_cartesianindices!(arr::AbstractArray, f::F) where F  # force specialization
    @inbounds for I ∈ CartesianIndices(arr)
        arr[I] = f(I)
    end
    return arr
end

Here is a similar @generated function using the Base.Cartesian macros.

using Base.Cartesian

@generated function fillfn_basecartesian!(arr::AbstractArray{<:Any,N}, f::F) where {N,F}
    # Generate an expression of nested for loops:
    #   for i_N ∈ axes(arr, N), ..., i_1 ∈ axes(arr, 1)
    #       arr[i_1, ..., i_N] = f(i_1, ..., i_N)
    #   end
    quote
        @inbounds @nloops $N i arr begin
            (@nref $N arr i) = (@ncall $N f i)
        end
        return arr
    end
end

Finally, here is function of indices to feed to either of the above functions (for the four dimensional case).

    indexfunc4d(i, j, k, l) = (i==k) * (j==l) - (i==l) * (j==k)
    @inline indexfunc4d(I::CartesianIndex{4}) = indexfunc4d(Tuple(I)...)

Benchmarking on a 3×3×3×3 Array shows that using CartesianIndices has no performance overhead.

using BenchmarkTools
using StaticArrays

let dims = (3,3,3,3)
    T = eltype(indexfunc4d(map(_ -> 1, dims)...))

    println("Using `CartesianIndices`")
    out1 = @btime(
        fillfn_cartesianindices!(A, $indexfunc4d),
        setup=(A = Array{$T}(undef, $dims))
    )
    println("Using `Base.Cartesian`")
    out2 = @btime(
        fillfn_basecartesian!(A, $indexfunc4d),
        setup=(A = Array{$T}(undef, $dims))
    )
    @assert out1 == out2
end;
Using `CartesianIndices`
  107.183 ns (0 allocations: 0 bytes)
Using `Base.Cartesian`
  114.619 ns (0 allocations: 0 bytes)

A similar benchmark where arr isa MArray shows that fillfn_basecartesian! 7x faster than fillfn_cartesianindices!.

    let SDims = NTuple{4,3}
        dims = fieldtypes(SDims)
        T = eltype(indexfunc4d(map(_ -> 1, dims)...))  # Element type
    
        println("Using `CartesianIndices`")
        out1 = @btime(
            fillfn_cartesianindices!(A, $indexfunc4d),
            setup=(A = MArray{$SDims,$T}(undef))
        )
        println("Using `Base.Cartesian`")
        out2 = @btime(
            fillfn_basecartesian!(A, $indexfunc4d),
            setup=(A = MArray{$SDims,$T}(undef))
        )
    end;
Using `CartesianIndices`
  94.878 ns (0 allocations: 0 bytes)
Using `Base.Cartesian`
  12.877 ns (0 allocations: 0 bytes)

Note that using StaticArrays.sacollect with CartesianIndices seems to be efficient.

@btime StaticArrays.sacollect(SArray{NTuple{4,3}},
                              $indexfunc4d(I) for I ∈ CartesianIndices(ntuple(_ -> SOneTo(3), Val(4))));
@btime StaticArrays.sacollect(MArray{NTuple{4,3}},
                              $indexfunc4d(I) for I ∈ CartesianIndices(ntuple(_ -> SOneTo(3), Val(4))));
9.858 ns (0 allocations: 0 bytes)
87.575 ns (1 allocation: 672 bytes)

However, AFAIK, sacollect does not perform a for loop over CartesianIndices at runtime but rather constructs an expression during compilation to construct the array “all-at-once”, as required by SArray.

1 Like

It looks to me like MArray doesn’t have special support for CartesianIndices and I don’t know if this is even possible. Using @code_llvm I saw that

out2 = @btime(
            fillfn_basecartesian!(A, $indexfunc4d),
            setup=(A = MArray{$SDims,$T}(undef))
        )

compiles down to a series of call void @llvm.memset.p0i8.i64.

Thanks for your reply and to looking into this. Perhaps this is one of those cases why Base.Cartesian is kept around. I opened an issue with StaticArrays in case others run into the same problem or know a way to fix the issue.

Performance of using CartesianIndices with MArray · Issue #1010 · JuliaArrays/StaticArrays.jl · GitHub.