Performance details of StaticArray

Gyslain · June 7, 2018, 2:33pm

Hello!
In the StaticArray documentation, we read :

The speed of small SVectors, SMatrixs and SArrays is often > 10 × faster than Base.Array
These results improve significantly when using julia -O3 with immutable static arrays, as the extra optimization results in surprisingly good SIMD code.
A very rough rule of thumb is that you should consider using a normal Array for arrays larger than 100 elements.

Could you please explain a bit more the reasons of the performance gain of 1. and loss of 3. ?
One reason I see is statically sized array can improve loop unrolling.
Anything else?

ufechner7 · June 7, 2018, 9:06pm

Well, I think using StaticArrays avoids calling OpenBlas. OpenBlas has a curtain call overhead, which is worth using it for larger arrays, but not for smaller arrays. In addition SArrays are stack allocated and not heap allocated. Not sure about mutable MVectos, though.

ChrisRackauckas · June 7, 2018, 9:18pm

SArrays are structs of tuples. MArrays are tuples in a mutable struct. StaticArrays.jl then writes @generated functions which generate the hand-done code based on the size of the array, so yes it completely avoids BLAS and in many cases avoids any and all looping and just evaluates straight statements. For example, matrix multiplication can be tough to read:

github.com

JuliaArrays/StaticArrays.jl/blob/master/src/matrix_multiply.jl

import LinearAlgebra: BlasFloat, matprod, mul!


# Manage dispatch of * and mul!
# TODO Adjoint? (Inner product?)

# *(A::StaticMatMulLike, B::AbstractVector) causes an ambiguity with SparseArrays
@inline *(A::StaticMatrix, B::AbstractVector) = _mul(Size(A), A, B)
@inline *(A::StaticMatMulLike, B::StaticVector) = _mul(Size(A), Size(B), A, B)
@inline *(A::StaticMatrix, B::StaticVector) = _mul(Size(A), Size(B), A, B)
@inline *(A::StaticMatMulLike, B::StaticMatMulLike) = _mul(Size(A), Size(B), A, B)
@inline *(A::StaticVector, B::StaticMatMulLike) = *(reshape(A, Size(Size(A)[1], 1)), B)
@inline *(A::StaticVector, B::Transpose{<:Any, <:StaticVector}) = _mul(Size(A), Size(B), A, B)
@inline *(A::StaticVector, B::Adjoint{<:Any, <:StaticVector}) = _mul(Size(A), Size(B), A, B)
@inline *(A::StaticArray{Tuple{N,1},<:Any,2}, B::Adjoint{<:Any,<:StaticVector}) where {N} = vec(A) * B
@inline *(A::StaticArray{Tuple{N,1},<:Any,2}, B::Transpose{<:Any,<:StaticVector}) where {N} = vec(A) * B

"""
    mul_result_structure(a::Type, b::Type)

This file has been truncated. show original

but the matrix inverses are hard coded for the small case and you can see why this would be much faster than a generic algorithm:

github.com

JuliaArrays/StaticArrays.jl/blob/master/src/inv.jl

@inline function inv(A::StaticMatrix)
    T = eltype(A)
    S = arithmetic_closure(T)
    A_S = convert(similar_type(A,S),A)
    _inv(Size(A_S),A_S)
end

@inline _inv(::Size{(0,0)}, A) = similar_type(A,typeof(inv(one(eltype(A)))))()

@inline _inv(::Size{(1,1)}, A) = similar_type(A)(inv(A[1]))

@inline function _inv(::Size{(2,2)}, A)
    newtype = similar_type(A)
    idet = 1/det(A)
    @inbounds return newtype((A[4]*idet, -(A[2]*idet), -(A[3]*idet), A[1]*idet))
end

@inline function _inv(::Size{(3,3)}, A)
    newtype = similar_type(A)

This file has been truncated. show original

Then they all SIMD really well as well.

So there’s two things going on. SArrays are stack-allocated, so there’s no heap allocations when using them. Then they have fast dispatches based on their size. So if you’re small enough to where Julia tuples are fast, these are fast.

foobar_lv2 · June 7, 2018, 10:10pm

Re loss for large SVectors:

julia> using StaticArrays
julia> v=SVector(collect(1:1000)...);
julia> @time +(v,v);
  2.412768 seconds (846.77 k allocations: 43.634 MiB, 6.45% gc time)
julia> @code_native +(v,v);
 [disgusting no-jump code]

Branches are not that expensive, and fully unrolling the loop is very bad, both for compile speeds and runtime speeds (instruction cache is limited).

cstjean · June 7, 2018, 10:25pm

Do they have to avoid branches for some reason, or is that a conscious tradeoff?

tkoolen · June 7, 2018, 11:04pm

I know you know, but just so that it’s clear to everybody else: that’s mostly measuring compilation time.

StaticArrays falls back to a chunked approach or to BLAS for some operations after a certain size limit, but this is not one of those operations. In general, I wouldn’t rely on StaticArrays making the right decision as to when it is beneficial to fall back to BLAS, also because the cutoff point is kind of subjective (how much compilation time is too much, for example?).

Conscious decision that works well for small arrays.

Topic		Replies	Views
Are StaticArrays always faster? Performance matrices , staticarrays	2	597	January 1, 2024
What is meant by this tip for StaticArrays? Performance	6	552	February 6, 2024
StaticArrays compilation performance bug Performance staticarrays	4	457	July 18, 2021
Strange performance of a loop Performance	37	3230	July 21, 2018
Whats the difference between a regular array and an array from StaticArrays General Usage	7	4206	November 22, 2018

Performance details of StaticArray

Related topics