Relevant WIP PR, hoping something like this gets merged one day:
What if the multiplication is out of place? Will FixedSizeArrays.jl hit worse paths?
Technically, there are actually some cases where FixedSizeArray
could be faster even for in-place BLAS operations because when you do e.g.
mul!(C, A, B)
it has to follow one extra degree of pointer indirection to actually get the pointers for C
, A
, and B
if they are Arrays
in order to pass those pointers to the BLAS library.
That is, with array, it has to look inside the mutable struct Array
for the pointer to the MemoryRef
, and then follow the MemoryRef
’s pointer to the actual data. With FixedSizeArray
, the reference to the Memory
is stored inline, so it only needs to follow one pointer to get to the data.
In reality though,
- This is an O(1) cost at worst, so only relevant for very small matrices you’re multiplying many times and the data isn’t hot in the cache
- This extra degree of pointer indirection is something that julia is often quite good at eliding, so it’s sometimes an O(0) cost.
- The current implementation of FixedSizeArrays.jl is actually slower at stuff like
mul!
than regularArray
probably because of some missing dispatches or something in the chain betweenmul!
and an actualBLAS
ccall. So the advantage is only theoretical, whereas in practice it actually has a disadvantage.
Yes, it’s totally possible that in some cases FixedSizeArrays.jl is hitting a slow generic fallback method for AbstractArray
, when pretty much the same method as Array
would work and would be equally efficient.
One case where this happened was for rand!
: we were hitting the slow generic fallback method, just copying the methods for Array
s closed the performance gap.
At the moment we don’t have any specialised methods for LinearAlgebra
functions, if users find methods that could be improved contributions are very much welcome!
LinearAlgebra.jl uses the StridedArray
union for most of its dispatch to BLAS. As a subtype of DenseArray
, FixedSizedArray
is included in that union, so it shouldn’t need a lot of dedicated methods. Any MWEs for slow BLAS/LAPACK operations?

Any MWEs for slow BLAS/LAPACK operations?
Here’s my MWE, not sure if it’s the same that @Mason found:
julia> using BenchmarkTools, LinearAlgebra, FixedSizeArrays
julia> @btime mul!(x, y, z) setup=(n = 4; x = FixedSizeArray(rand(Float32, n, n)); y = FixedSizeArray(rand(Float32, n, n)); z = FixedSizeArray(rand(Float32, n, n));) seconds=Inf samples=100000 evals=10;
147.200 ns (0 allocations: 0 bytes)
julia> @btime mul!(x, y, z) setup=(n = 4; x = rand(Float32, n, n); y = rand(Float32, n, n); z = rand(Float32, n, n);) seconds=Inf samples=100000 evals=10;
134.300 ns (0 allocations: 0 bytes)
Profiling this example, as far as I see the same codepaths are taken, yet Matrix
is faster than FixedSizeMatrix
. With both C=false
and C=true
I see the same functions and line numbers in the flame graph. Not sure why the difference in performance. Perhaps taking a look at the profiles in PProf could reveal something, trying that next.

What if the multiplication is out of place? Will FixedSizeArrays.jl hit worse paths?
As explained by @danielwe, FixedSizeArray
and Array
both subtype DenseArray
, so they should basically always have the same code paths for operations like that. The difference is that Array
is slightly costlier to allocate and dereference and has worse effects, making it less likely the compiler will figure out good optimizations.
On my end there’s literally no daylight between these:
julia> using BenchmarkTools, LinearAlgebra, FixedSizeArrays
julia> @btime mul!(x, y, z) setup=(n = 4; x = FixedSizeArray(rand(Float32, n, n)); y = FixedSizeArray(rand(Float32, n, n)); z = FixedSizeArray(rand(Float32, n, n));) seconds=Inf samples=100000 evals=10;
37.500 ns (0 allocations: 0 bytes)
julia> @btime mul!(x, y, z) setup=(n = 4; x = rand(Float32, n, n); y = rand(Float32, n, n); z = rand(Float32, n, n);) seconds=Inf samples=100000 evals=10;
37.500 ns (0 allocations: 0 bytes)
My version info is:
Julia Version 1.13.0-DEV.721
Commit 7381e22c184 (2025-06-08 23:19 UTC)
Build Info:
Official https://julialang.org release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 8 × AMD Ryzen 3 5300U with Radeon Graphics
WORD_SIZE: 64
LLVM: libLLVM-20.1.2 (ORCJIT, znver2)
GC: Built with stock GC
Threads: 1 default, 1 interactive, 1 GC (on 8 virtual cores)
Given that your machine seems faster, you may need to increase evals
in the @btime
line.
Here’s with evals = 256
. There’s only a ~0.5 ns gap in Array
s favor, but it’s actually reproducible. (0.8 ns as shown here, but across multiple samples the shift between the distributions seems closer to 0.5-0.6 ns.)
julia> @btime mul!(x, y, z) setup=(n = 4; x = FixedSizeArray(rand(Float32, n, n)); y = FixedSizeArray(rand(Float32, n, n)); z = FixedSizeArray(rand(Float32, n, n));) seconds=Inf samples=100000 evals=256;
41.828 ns (0 allocations: 0 bytes)
julia> @btime mul!(x, y, z) setup=(n = 4; x = rand(Float32, n, n); y = rand(Float32, n, n); z = rand(Float32, n, n);) seconds=Inf samples=100000 evals=256;
41.016 ns (0 allocations: 0 bytes)
julia> versioninfo() # with OpenBLAS; times are lower if I load AppleAccelerate.jl
Julia Version 1.11.5
Commit 760b2e5b739 (2025-04-14 06:53 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: macOS (arm64-apple-darwin24.0.0)
CPU: 14 × Apple M4 Pro
WORD_SIZE: 64
LLVM: libLLVM-16.0.6 (ORCJIT, apple-m1)
Threads: 1 default, 0 interactive, 1 GC (on 10 virtual cores)
It’s interesting that with more evaluations you get a larger minimum time, if anything I’d expect it to be shorter
Yeah, but with the perfectly round, perfectly reproducible 37.500 ns there’s clearly something fishy going on at low evals. For evals = 1
I get 0.001 ns, for evals = 2
it’s 20.500, and it keeps growing until it saturates at the 41 or so ns shown above. My guess is BenchmarkTools.jl is overcompensating for the setup code, so you need enough evals to make that inaccuracy negligible. Specifically, it seems like I need an evals of at least ~100 to get the correct value, i.e., each sample has to take around 4 us or more (evals * btime).
Can Array
s of FixedSizeArray
s or nested FixedSizeArray
s be stored as contiguous memory?
No. They still are mutable objects with identity.
Not an independent identity, they’re mutable the way SubArray
s are:
struct FixedSizeArray{T,N,Mem<:DenseVector{T}} <: DenseArray{T,N}
mem::Mem
size::NTuple{N,Int}
...
"""
...
...default memory backend (`Vector{T}` on Julia v1.10, `Memory{T}` on Julia v1.11+).
"""
Oscar was talking about Memory
which is the default data container for a FixedSizeArray
. Yes, the FixedSizeArray
itself can be allocated inline, but that just means you put a pointer and an NTuple
inline. The actual contents of the array themselves are not stored inline, so no continuous memory stuff like with StaticArrays.
Can you elaborate on the reasons for your decision to do
struct FixedSizeArray{T,N,Mem<:DenseVector{T}} <: DenseArray{T,N}
mem::Mem
size::NTuple{N,Int}
end
as opposed to
mutable struct AlternativeFixedSizeArray{T,N,Mem<:DenseVector{T}} <: DenseArray{T,N}
const mem::Mem
const size::NTuple{N,Int}
end
Both inform the compiler that the fields are constant (which allows the all-important LICM). The main difference is that the wrapper AlternativeFixedSizeArray
is always heap-allocated (and it has an identity), while FixedSizeArray
tries to be stack/register allocated.
Stack/register allocation is of course nice if you have many short-lived FixedSizeArrays that are backed by longer-living Memory instances. But I don’t see this happening in real life.
Since Memory is already heap-allocated, and since most uses will presumably have the same lifetime as the backing Memory instance, you don’t save a lot of allocs/memory. But you potentially lose a lot of perf – stuff needs to always be spilled to the stack, possible boxing-unboxing conversions, etc.
PS. The trade-offs are very different for e.g. array views! These are typically short-lived, and re-use longer lived backing / parent memory instances. So views must be struct
instead of mutable struct
.

Can you elaborate on the reasons for your decision to do
struct FixedSizeArray{T,N,Mem<:DenseVector{T}} <: DenseArray{T,N} mem::Mem size::NTuple{N,Int} end
as opposed to
mutable struct AlternativeFixedSizeArray{T,N,Mem<:DenseVector{T}} <: DenseArray{T,N} const mem::Mem const size::NTuple{N,Int} end
That was the definition before make FixedSizeArray immutable by oscardssmith · Pull Request #2 · JuliaArrays/FixedSizeArrays.jl · GitHub

Can you elaborate on the reasons for your decision to do
struct FixedSizeArray{T,N,Mem<:DenseVector{T}} <: DenseArray{T,N} mem::Mem size::NTuple{N,Int} end
as opposed to
mutable struct AlternativeFixedSizeArray{T,N,Mem<:DenseVector{T}} <: DenseArray{T,N} const mem::Mem const size::NTuple{N,Int} end
Both inform the compiler that the fields are constant (which allows the all-important LICM). The main difference is that the wrapper
AlternativeFixedSizeArray
is always heap-allocated (and it has an identity), whileFixedSizeArray
tries to be stack/register allocated.
Could you expand on why you’d want to use the const
-mutable
pattern here? The only reason I can think of wanting to do this would be ~1ns faster hashing, and being able to attach finalizers to a specific FixedSizeArray
rather than it’s .mem
field.
I’m not sure I see any advantage to that, especially relative to the GC overhead downside you’d have from more allocations.
Sidenote, but AlternativeFixedSizeArray
is not always heap allocated. If it doesn’t escape a function body modulo inlining, we’re allowed to put mutable structs on the stack.
Julia has zero semantic guarentees for what stuff is heap versus stack allocated.
I’m curious about a comparison between Memory{T}
and FixedSizeVector{T, Memory{T}}
. For example, one issue I spotted with Memory
is that a .+ a
yields a Vector
when a::Memory
, but maybe that’s only a transient problem (cf. also Inconsistent return types of `map` when applied to `Memory` · Issue #56962 · JuliaLang/julia · GitHub).
When would you recommend FixedSizeVector
over Memory
?