Performance of using CartesianIndices with MArray

nchisholm · February 27, 2022, 10:40pm

Multidimensional algorithms built on CartesianIndices seem to have significant performance overhead when applied to StaticArrays.MArrays, but not regular Arrays. Writing several functions with nested loops for different dimensions (or leveraging the macros in Base.Cartesian) recovers performance, but I am wondering why the more elegant CartesianIndices solution is slower when used with MArray.

To demonstrate the problem, consider an N-dimensional algorithm that takes a function f(i_1, ⋯, i_N) of the N indices and uses it to fill the elements of an input array arr.

function fillfn_cartesianindices!(arr::AbstractArray, f::F) where F  # force specialization
    @inbounds for I ∈ CartesianIndices(arr)
        arr[I] = f(I)
    end
    return arr
end

Here is a similar @generated function using the Base.Cartesian macros.

using Base.Cartesian

@generated function fillfn_basecartesian!(arr::AbstractArray{<:Any,N}, f::F) where {N,F}
    # Generate an expression of nested for loops:
    #   for i_N ∈ axes(arr, N), ..., i_1 ∈ axes(arr, 1)
    #       arr[i_1, ..., i_N] = f(i_1, ..., i_N)
    #   end
    quote
        @inbounds @nloops $N i arr begin
            (@nref $N arr i) = (@ncall $N f i)
        end
        return arr
    end
end

Finally, here is function of indices to feed to either of the above functions (for the four dimensional case).

    indexfunc4d(i, j, k, l) = (i==k) * (j==l) - (i==l) * (j==k)
    @inline indexfunc4d(I::CartesianIndex{4}) = indexfunc4d(Tuple(I)...)

Benchmarking on a 3×3×3×3 Array shows that using CartesianIndices has no performance overhead.

using BenchmarkTools
using StaticArrays

let dims = (3,3,3,3)
    T = eltype(indexfunc4d(map(_ -> 1, dims)...))

    println("Using `CartesianIndices`")
    out1 = @btime(
        fillfn_cartesianindices!(A, $indexfunc4d),
        setup=(A = Array{$T}(undef, $dims))
    )
    println("Using `Base.Cartesian`")
    out2 = @btime(
        fillfn_basecartesian!(A, $indexfunc4d),
        setup=(A = Array{$T}(undef, $dims))
    )
    @assert out1 == out2
end;

Using `CartesianIndices`
  107.183 ns (0 allocations: 0 bytes)
Using `Base.Cartesian`
  114.619 ns (0 allocations: 0 bytes)

A similar benchmark where arr isa MArray shows that fillfn_basecartesian! 7x faster than fillfn_cartesianindices!.

    let SDims = NTuple{4,3}
        dims = fieldtypes(SDims)
        T = eltype(indexfunc4d(map(_ -> 1, dims)...))  # Element type
    
        println("Using `CartesianIndices`")
        out1 = @btime(
            fillfn_cartesianindices!(A, $indexfunc4d),
            setup=(A = MArray{$SDims,$T}(undef))
        )
        println("Using `Base.Cartesian`")
        out2 = @btime(
            fillfn_basecartesian!(A, $indexfunc4d),
            setup=(A = MArray{$SDims,$T}(undef))
        )
    end;

Using `CartesianIndices`
  94.878 ns (0 allocations: 0 bytes)
Using `Base.Cartesian`
  12.877 ns (0 allocations: 0 bytes)

Note that using StaticArrays.sacollect with CartesianIndices seems to be efficient.

@btime StaticArrays.sacollect(SArray{NTuple{4,3}},
                              $indexfunc4d(I) for I ∈ CartesianIndices(ntuple(_ -> SOneTo(3), Val(4))));
@btime StaticArrays.sacollect(MArray{NTuple{4,3}},
                              $indexfunc4d(I) for I ∈ CartesianIndices(ntuple(_ -> SOneTo(3), Val(4))));

9.858 ns (0 allocations: 0 bytes)
87.575 ns (1 allocation: 672 bytes)

However, AFAIK, sacollect does not perform a for loop over CartesianIndices at runtime but rather constructs an expression during compilation to construct the array “all-at-once”, as required by SArray.

goerch · March 1, 2022, 1:24pm

It looks to me like MArray doesn’t have special support for CartesianIndices and I don’t know if this is even possible. Using @code_llvm I saw that

out2 = @btime(
            fillfn_basecartesian!(A, $indexfunc4d),
            setup=(A = MArray{$SDims,$T}(undef))
        )

compiles down to a series of call void @llvm.memset.p0i8.i64.

nchisholm · March 2, 2022, 3:29pm

Thanks for your reply and to looking into this. Perhaps this is one of those cases why Base.Cartesian is kept around. I opened an issue with StaticArrays in case others run into the same problem or know a way to fix the issue.

https://github.com/JuliaArrays/StaticArrays.jl/issues/1010.