Performance details of StaticArray

Hello!
In the StaticArray documentation, we read :

  1. The speed of small SVectors, SMatrixs and SArrays is often > 10 × faster than Base.Array
  2. These results improve significantly when using julia -O3 with immutable static arrays, as the extra optimization results in surprisingly good SIMD code.
  3. A very rough rule of thumb is that you should consider using a normal Array for arrays larger than 100 elements.

Could you please explain a bit more the reasons of the performance gain of 1. and loss of 3. ?
One reason I see is statically sized array can improve loop unrolling.
Anything else?

Well, I think using StaticArrays avoids calling OpenBlas. OpenBlas has a curtain call overhead, which is worth using it for larger arrays, but not for smaller arrays. In addition SArrays are stack allocated and not heap allocated. Not sure about mutable MVectos, though.

1 Like

SArrays are structs of tuples. MArrays are tuples in a mutable struct. StaticArrays.jl then writes @generated functions which generate the hand-done code based on the size of the array, so yes it completely avoids BLAS and in many cases avoids any and all looping and just evaluates straight statements. For example, matrix multiplication can be tough to read:

but the matrix inverses are hard coded for the small case and you can see why this would be much faster than a generic algorithm:

Then they all SIMD really well as well.

So there’s two things going on. SArrays are stack-allocated, so there’s no heap allocations when using them. Then they have fast dispatches based on their size. So if you’re small enough to where Julia tuples are fast, these are fast.

4 Likes

Re loss for large SVectors:

julia> using StaticArrays
julia> v=SVector(collect(1:1000)...);
julia> @time +(v,v);
  2.412768 seconds (846.77 k allocations: 43.634 MiB, 6.45% gc time)
julia> @code_native +(v,v);
 [disgusting no-jump code]

Branches are not that expensive, and fully unrolling the loop is very bad, both for compile speeds and runtime speeds (instruction cache is limited).

Do they have to avoid branches for some reason, or is that a conscious tradeoff?

I know you know, but just so that it’s clear to everybody else: that’s mostly measuring compilation time.

StaticArrays falls back to a chunked approach or to BLAS for some operations after a certain size limit, but this is not one of those operations. In general, I wouldn’t rely on StaticArrays making the right decision as to when it is beneficial to fall back to BLAS, also because the cutoff point is kind of subjective (how much compilation time is too much, for example?).

Conscious decision that works well for small arrays.

3 Likes