MArray performance

For small mutable arrays, my (perhaps incorrect) understanding was that MArrays from StaticArrays.jl would provide better performance than regular arrays. However, in the example I am working on (where I use an MArray for input and pre-allocated output), I am finding much worse performance. Consider the following code below.

using StaticArrays
using BenchmarkTools

@inline function LogisticCdf!(y::AbstractArray{T}, x::AbstractArray{T}, μ::T, σ::T) where{T <: Real}
    @inbounds for iₓ in eachindex(x)
        z = exp(-abs(x[iₓ] - μ) / σ)
        y[iₓ] = (x[iₓ] ≥ μ ? one(T) : z) / (one(T) + z)
    end
end

function test()
    x = collect(LinRange(-4.0, 4.0, 50))
    xm = MVector{50, Float64}(x)
    μ = 2.0; σ = 2.2
    y = similar(x)
    ym = similar(xm)
    @btime LogisticCdf!($y, $x, $μ, $σ)
    @btime LogisticCdf!($ym, $xm, $μ, $σ)
end

test()

I obtain

  143.026 ns (0 allocations: 0 bytes)
  292.181 ns (0 allocations: 0 bytes)

so I’m finding it twice as slow. This makes me think that I am either using MArrays incorrectly, or inappropriately. Any guidance would be appreciated. Thanks.

2 Likes

I don’t have a lot of time to dig into this right now, so my response will be limited.

I can’t explain why MArray is considerably slower in this case.

However, I wouldn’t generally expect MArray to be much/any faster than a normal Array except when functions can take specific advantage of the size being encoded in the type (for loop unrolling, primarily). For example, linear-algebraic operations using MArrays will write custom functions for the exact size and usually beat BLAS (used by Array) by a wide margin at small sizes. But your function wouldn’t allow this information to be used in a helpful way.

I could expect a performance benefit to using a SArray here instead, especially if you count the similar(_) preallocation for these other two. Making the array static usually puts it on the stack and that can carry some further benefits over even a MArray. But you’d need to write a different LogisticCdf function to support that - probably based on calling map over an SVector version of x. Something like

function LogisticCdf(x::AbstractArray{T}, μ::T, σ::T) where{T <: Real}
    return map(x) do xi
        z = exp(-abs(xi - μ) / σ)
        return (xi ≥ μ ? one(T) : z) / (one(T) + z)
    end
end

If you are ever in a situation where you can use SArray instead of MArray without excessive gymnastics, I’d recommend it.

EDIT: I realize later that I should have just written the scalar version of LogisticCdf and then broadcast it over the desired array. So more like

function LogisticCdf(x::T, μ::T, σ::T) where{T <: Real}
    z = exp(-abs(xi - μ) / σ)
    return (xi ≥ μ ? one(T) : z) / (one(T) + z)
end

@btime LogisticCdf.($x, $μ, $σ)

setindex!() for MVector goes through some unsafe_store! which may or may not be friendly to loop compiler optimization?

I get

julia> using LoopVectorization, StaticArrays

julia> @inline function LogisticCdf_lv!(y::AbstractArray{T}, x::AbstractArray{T}, μ::T, σ::T) where{T <: Real}
           @turbo for iₓ in eachindex(x)
               z = exp(-abs(x[iₓ] - μ) / σ)
               y[iₓ] = (x[iₓ] ≥ μ ? one(T) : z) / (one(T) + z)
           end
       end
LogisticCdf_lv! (generic function with 1 method)

julia> @inline function LogisticCdf!(y::AbstractArray{T}, x::AbstractArray{T}, μ::T, σ::T) where{T <: Real}
           @inbounds @simd ivdep for iₓ in eachindex(x)
               z = @inline exp(-abs(x[iₓ] - μ) / σ)
               y[iₓ] = (x[iₓ] ≥ μ ? one(T) : z) / (one(T) + z)
           end
       end
LogisticCdf! (generic function with 1 method)

julia> function test()
           x = collect(LinRange(-4.0, 4.0, 50))
           xm = MVector{50, Float64}(x)
           μ = 2.0; σ = 2.2
           y = similar(x)
           ym = similar(xm)
           @btime LogisticCdf!($y, $x, $μ, $σ)
           @btime LogisticCdf!($ym, $xm, $μ, $σ)
           @btime LogisticCdf_lv!($y, $x, $μ, $σ)
           @btime LogisticCdf_lv!($ym, $xm, $μ, $σ)
       end
test (generic function with 1 method)

julia> test()
  124.194 ns (0 allocations: 0 bytes)
  120.487 ns (0 allocations: 0 bytes)
  75.033 ns (0 allocations: 0 bytes)
  63.810 ns (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.9.0-DEV.1331
Commit 6f8e24c672* (2022-09-12 16:52 UTC)

Performance is really bad for the non-@turbo version on Julia 1.8.1, though, because table_unpack isn’t inlining into exp.

I think static arrays are faster only when you use them to avoid allocations that you would get with regular arrays.