[ANN] FixedSizeArrays.jl: What Array probably should have been

Relevant WIP PR, hoping something like this gets merged one day:

1 Like

What if the multiplication is out of place? Will FixedSizeArrays.jl hit worse paths?

Technically, there are actually some cases where FixedSizeArray could be faster even for in-place BLAS operations because when you do e.g.

mul!(C, A, B)

it has to follow one extra degree of pointer indirection to actually get the pointers for C, A, and B if they are Arrays in order to pass those pointers to the BLAS library.

That is, with array, it has to look inside the mutable struct Array for the pointer to the MemoryRef, and then follow the MemoryRef’s pointer to the actual data. With FixedSizeArray, the reference to the Memory is stored inline, so it only needs to follow one pointer to get to the data.

In reality though,

  1. This is an O(1) cost at worst, so only relevant for very small matrices you’re multiplying many times and the data isn’t hot in the cache
  2. This extra degree of pointer indirection is something that julia is often quite good at eliding, so it’s sometimes an O(0) cost.
  3. The current implementation of FixedSizeArrays.jl is actually slower at stuff like mul! than regular Array probably because of some missing dispatches or something in the chain between mul! and an actual BLAS ccall. So the advantage is only theoretical, whereas in practice it actually has a disadvantage.
5 Likes

Yes, it’s totally possible that in some cases FixedSizeArrays.jl is hitting a slow generic fallback method for AbstractArray, when pretty much the same method as Array would work and would be equally efficient.

One case where this happened was for rand!: we were hitting the slow generic fallback method, just copying the methods for Arrays closed the performance gap.

At the moment we don’t have any specialised methods for LinearAlgebra functions, if users find methods that could be improved contributions are very much welcome!

7 Likes

LinearAlgebra.jl uses the StridedArray union for most of its dispatch to BLAS. As a subtype of DenseArray, FixedSizedArray is included in that union, so it shouldn’t need a lot of dedicated methods. Any MWEs for slow BLAS/LAPACK operations?

4 Likes

:+1: Here’s my MWE, not sure if it’s the same that @Mason found:

julia> using BenchmarkTools, LinearAlgebra, FixedSizeArrays

julia> @btime mul!(x, y, z) setup=(n = 4; x = FixedSizeArray(rand(Float32, n, n)); y = FixedSizeArray(rand(Float32, n, n)); z = FixedSizeArray(rand(Float32, n, n));) seconds=Inf samples=100000 evals=10;
  147.200 ns (0 allocations: 0 bytes)

julia> @btime mul!(x, y, z) setup=(n = 4; x = rand(Float32, n, n); y = rand(Float32, n, n); z = rand(Float32, n, n);) seconds=Inf samples=100000 evals=10;
  134.300 ns (0 allocations: 0 bytes)

Profiling this example, as far as I see the same codepaths are taken, yet Matrix is faster than FixedSizeMatrix. With both C=false and C=true I see the same functions and line numbers in the flame graph. Not sure why the difference in performance. Perhaps taking a look at the profiles in PProf could reveal something, trying that next.

3 Likes

As explained by @danielwe, FixedSizeArray and Array both subtype DenseArray, so they should basically always have the same code paths for operations like that. The difference is that Array is slightly costlier to allocate and dereference and has worse effects, making it less likely the compiler will figure out good optimizations.

1 Like

On my end there’s literally no daylight between these:

julia> using BenchmarkTools, LinearAlgebra, FixedSizeArrays

julia> @btime mul!(x, y, z) setup=(n = 4; x = FixedSizeArray(rand(Float32, n, n)); y = FixedSizeArray(rand(Float32, n, n)); z = FixedSizeArray(rand(Float32, n, n));) seconds=Inf samples=100000 evals=10;
  37.500 ns (0 allocations: 0 bytes)

julia> @btime mul!(x, y, z) setup=(n = 4; x = rand(Float32, n, n); y = rand(Float32, n, n); z = rand(Float32, n, n);) seconds=Inf samples=100000 evals=10;
  37.500 ns (0 allocations: 0 bytes)
4 Likes

My version info is:

Julia Version 1.13.0-DEV.721
Commit 7381e22c184 (2025-06-08 23:19 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × AMD Ryzen 3 5300U with Radeon Graphics
  WORD_SIZE: 64
  LLVM: libLLVM-20.1.2 (ORCJIT, znver2)
  GC: Built with stock GC
Threads: 1 default, 1 interactive, 1 GC (on 8 virtual cores)

Given that your machine seems faster, you may need to increase evals in the @btime line.

Here’s with evals = 256. There’s only a ~0.5 ns gap in Arrays favor, but it’s actually reproducible. (0.8 ns as shown here, but across multiple samples the shift between the distributions seems closer to 0.5-0.6 ns.)

julia> @btime mul!(x, y, z) setup=(n = 4; x = FixedSizeArray(rand(Float32, n, n)); y = FixedSizeArray(rand(Float32, n, n)); z = FixedSizeArray(rand(Float32, n, n));) seconds=Inf samples=100000 evals=256;
  41.828 ns (0 allocations: 0 bytes)

julia> @btime mul!(x, y, z) setup=(n = 4; x = rand(Float32, n, n); y = rand(Float32, n, n); z = rand(Float32, n, n);) seconds=Inf samples=100000 evals=256;
  41.016 ns (0 allocations: 0 bytes)

julia> versioninfo()  # with OpenBLAS; times are lower if I load AppleAccelerate.jl
Julia Version 1.11.5
Commit 760b2e5b739 (2025-04-14 06:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin24.0.0)
  CPU: 14 × Apple M4 Pro
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, apple-m1)
Threads: 1 default, 0 interactive, 1 GC (on 10 virtual cores)
1 Like

It’s interesting that with more evaluations you get a larger minimum time, if anything I’d expect it to be shorter :thinking:

Yeah, but with the perfectly round, perfectly reproducible 37.500 ns there’s clearly something fishy going on at low evals. For evals = 1 I get 0.001 ns, for evals = 2 it’s 20.500, and it keeps growing until it saturates at the 41 or so ns shown above. My guess is BenchmarkTools.jl is overcompensating for the setup code, so you need enough evals to make that inaccuracy negligible. Specifically, it seems like I need an evals of at least ~100 to get the correct value, i.e., each sample has to take around 4 us or more (evals * btime).

2 Likes

Can Arrays of FixedSizeArrays or nested FixedSizeArrays be stored as contiguous memory?

No. They still are mutable objects with identity.

4 Likes

Not an independent identity, they’re mutable the way SubArrays are:

struct FixedSizeArray{T,N,Mem<:DenseVector{T}} <: DenseArray{T,N}
    mem::Mem
    size::NTuple{N,Int}
...
"""
...
...default memory backend (`Vector{T}` on Julia v1.10, `Memory{T}` on Julia v1.11+).
"""

Oscar was talking about Memory which is the default data container for a FixedSizeArray. Yes, the FixedSizeArray itself can be allocated inline, but that just means you put a pointer and an NTuple inline. The actual contents of the array themselves are not stored inline, so no continuous memory stuff like with StaticArrays.

5 Likes

Can you elaborate on the reasons for your decision to do

struct FixedSizeArray{T,N,Mem<:DenseVector{T}} <: DenseArray{T,N}
    mem::Mem
    size::NTuple{N,Int}
end

as opposed to

mutable struct AlternativeFixedSizeArray{T,N,Mem<:DenseVector{T}} <: DenseArray{T,N}
    const mem::Mem
    const size::NTuple{N,Int}
end

Both inform the compiler that the fields are constant (which allows the all-important LICM). The main difference is that the wrapper AlternativeFixedSizeArray is always heap-allocated (and it has an identity), while FixedSizeArray tries to be stack/register allocated.

Stack/register allocation is of course nice if you have many short-lived FixedSizeArrays that are backed by longer-living Memory instances. But I don’t see this happening in real life.

Since Memory is already heap-allocated, and since most uses will presumably have the same lifetime as the backing Memory instance, you don’t save a lot of allocs/memory. But you potentially lose a lot of perf – stuff needs to always be spilled to the stack, possible boxing-unboxing conversions, etc.

PS. The trade-offs are very different for e.g. array views! These are typically short-lived, and re-use longer lived backing / parent memory instances. So views must be struct instead of mutable struct.

That was the definition before make FixedSizeArray immutable by oscardssmith · Pull Request #2 · JuliaArrays/FixedSizeArrays.jl · GitHub

1 Like

Could you expand on why you’d want to use the const-mutable pattern here? The only reason I can think of wanting to do this would be ~1ns faster hashing, and being able to attach finalizers to a specific FixedSizeArray rather than it’s .mem field.

I’m not sure I see any advantage to that, especially relative to the GC overhead downside you’d have from more allocations.


Sidenote, but AlternativeFixedSizeArray is not always heap allocated. If it doesn’t escape a function body modulo inlining, we’re allowed to put mutable structs on the stack.

Julia has zero semantic guarentees for what stuff is heap versus stack allocated.

4 Likes

I’m curious about a comparison between Memory{T} and FixedSizeVector{T, Memory{T}}. For example, one issue I spotted with Memory is that a .+ a yields a Vector when a::Memory, but maybe that’s only a transient problem (cf. also Inconsistent return types of `map` when applied to `Memory` · Issue #56962 · JuliaLang/julia · GitHub).
When would you recommend FixedSizeVector over Memory ?