What's the deal with StaticArrays?

First, disclaimer that I have a shallow understanding of heap and stack and allocations. Given that, here is my basic understanding of StaticArrays: They are appropriate when there are lots of small arrays involved because StaticArrays use Tuples underneath, which live in the stack rather than heap. This is more efficient because there are no “allocations” when writing to the stack and reading from the stack is faster (or maybe I am wrong about the allocations bit there).

I am not sure how the compiler knowing the size of the arrays comes into play. If storing data in the heap, I understand that the program can set aside a fixed amount of memory (pre-allocate). Does the notion of pre-allocation apply to the stack as well?

Probably most importantly, how does the information about size get passed from the type signature to the compiler? I understand that the type system will create different types for each Size in StaticArrays{Size}, but how does the compiler know to use Size to decide how much memory to pre-allocate?

I would imagine that having static arrays would be useful even with big arrays for the following reasons:

  • it means the compiler doesn’t have to brace for the arrays changing size when there is no need
  • it means, in certain situations, the compiler can be confident about indices being within bound just by looking at the type. So bound errors can be caught at compile time (I believe this is how Rust works?). I believe bound checking only occurs at runtime?
  • beyond performance, there are cases where it is nice to have a gaurentee that the size of the container is not mutable. And we might not need to do linear algebra with it, just basic arithmatic. So even something basic like what Rust has would be good to have.

So why do we not have StaticArrays that are appropriate for arrays of all sizes?

2 Likes

It is not necessarily stored in the stack. The compiler may put a tuple (hence a static array) entirely in registers, so that it never goes into RAM at all.

There are (at least) three reasons why static arrays are fast for small fixed-size arrays:

  1. Completely unrolling to loop-free code. e.g. v3 = v1 + v2 is just implemented as 3 scalar additions, with no loop overhead at all, if these are all SVector{3, Float64}. The unrolled loops can also sometimes trigger SIMD optimizations.
  2. Avoiding heap (or even stack) allocations of temporary arrays. This is related to point (1) — working with static arrays as local variables is as if you had just written out all of the scalar operations by hand on all the components.
  3. Static arrays can be stored “inline” in other data structures. e.g. a length-N array Vector{SVector{3, Float64}}, such as [v1, v2, v3] from my example above, is stored as 3N consecutive Float64 values. This is much more efficient to access than a Vector{Vector{Float64}}, which is essentially an array of pointers to Vector{Float64} arrays, and is often faster and more convenient than a 3 \times N Matrix{Float64} (e.g. because the length 3 is stored in the type you get the benefits of (1) and (2) when accessing vectors in the array, and no “slicing” or “views” are needed). (There is also HybridArrays.jl to get the best of both worlds for matrices with a static number of rows.)

The compiler has special support for tuples (whose length is in their type signature), so anything else built on top of tuples inherits this support. (One of the major steps in this special support was julia#4042: “Tuples made me sad (so I fixed them)”.)

More generally, any integer value stored in the type gets treated as a compile-time constant for type-inferred code, which may trigger all sorts of LLVM optimizations. But the storage of tuples required specialized compiler work to implement.

It would be a totally different data structure, because the performance tradeoffs are different for long arrays. You (1) don’t want to unroll vector operations of long lengths because the code size explodes, (2) you can’t store them in registers and may even exceed the stack-size limit, and (3) you can already get many of the benefits of “inline” storage in arrays by storing them as the columns of a Matrix, whereas you wouldn’t want to store a large array inline in an arbitrary struct.

This is generally not a significant concern for large arrays.

In operations on large arrays where bounds checks are expensive, you can already eliminate this overhead by checking the bounds once outside of your loop(s) and then using @inbounds within your inner loop bodies. This is not (currently) automated by the compiler AFAIK, though.

This could be implemented quite easily by wrapping a Vector in another data structure, ala ReadOnlyArrays.jl. (I don’t recall a huge demand for this as a stand-alone Array-like data structure in which only the length is immutable, though?)

Actually, an easy way to fix the length of an array right now is just to use an N \times 1 Matrix, since a multidimensional Array in Julia cannot be resized — you can still access such a matrix efficiently as A[i], as if it were a 1d array with indices i ∈ 1:length(A) (what Julia calls “linear indexing”).

16 Likes

LLVM has an InductiveRangeCheckElimination pass that Julia uses.

So if you use for i = eachindex(x), it’ll be able to remove the bounds checks for x[i]. It won’t work for y[i], so you’d still have to be careful to get all array axes of all arrays you’re indexing.

E.g., use eachindex(x,y) instead, or maybe a eachindex(x) == eachindex(y) || throw(DimensionMismatch("x and y indices don't equal")) in front of the loop, or something of that sort.

It would be nice if the compiler could replace bounds-checked loops with loops that iterate over the intersection of any indexed axes and the original bounds, and then throw the appropriate bounds error afterwords if this intersection was a subset of the original loop’s bounds.

7 Likes

If I do this, will Julia be able to remove bounds checks on y[i] despite the loop being on eachindex(x) only? Does that solve my most common eachindex gripe below then?

Thanks @stevengj for a great response.

My take away is that there isn’t much performance benefit of having arrays be of static size when they are large.

I think there might still be safety benifits. I understand that you can implement something like ReadOnlyArrays.jl by validating the size inside constructors, but the fields of the objects, which would be plain old arrays, can be messed with after the objects are created. Maybe that’s a feature in the realm of Julia. I personally think it would be nice to have something between a Tuple and an Array. Where a Tuple is not mutable at all, a static array would have immutable size but mutable contents. EDIT: Jsut saw @stevengj 's response about using multi-dimensional arrays as fixed-sized arrays. I’d been thinking whether I should switch from a vector to a Nx1 matrix for this reason in something I am working on. Good to have it validated.

Also, just to confirm, bound checking gets skipped for eachindex(x) right? And eachindex(x, y) too? Or do we need @inbounds?

julia> function mydot(x,y)
           eachindex(x) == eachindex(y) || throw(DimensionMismatch("x and y must be same size!"))
           s = zero(Base.promote_eltype(x,y))
           @fastmath for i = eachindex(x)
               s += x[i]*y[i]
           end
           s
       end
mydot (generic function with 1 method)

julia> x = rand(128); y = rand(128);

julia> @btime mydot($x,$y)
  9.702 ns (0 allocations: 0 bytes)
30.668576075140155

Assembly and LLVM IR show the expected SIMD, e.g.

  %33 = fmul fast <8 x double> %wide.load59, %wide.load
  %34 = fmul fast <8 x double> %wide.load60, %wide.load56
  %35 = fmul fast <8 x double> %wide.load61, %wide.load57
  %36 = fmul fast <8 x double> %wide.load62, %wide.load58
  %37 = fadd fast <8 x double> %33, %vec.phi
  %38 = fadd fast <8 x double> %34, %vec.phi53
  %39 = fadd fast <8 x double> %35, %vec.phi54
  %40 = fadd fast <8 x double> %36, %vec.phi55

But I would suggest double checking that it is working correctly, at least for any code you expect to be hot enough to matter.

EDIT:
In this case, LLVM actually also managed to SIMD it even without the check. It created a fast path. So, it can do quite well in simple cases.

2 Likes