# Performance regression with StaticArrays?

Why in this example, the `@SVector` and `@MVector` macros both make performance worse?

Code: (you can copy and paste into REPL)

``````using BenchmarkTools
using StaticArrays
using Random

function g1(T)
levels = @SVector T[.7, .8, .9, 1.]
stacks = @MVector zeros(T, 4)
for i in 1:4
@inbounds stacks[i] = rand(levels)
end
stacks
end

function g2(T)
levels = @SVector T[.7, .8, .9, 1.]
stacks = zeros(T, 4)
for i in 1:4
@inbounds stacks[i] = rand(levels)
end
stacks
end

function g3(T)
levels = T[.7, .8, .9, 1.]
stacks = @MVector zeros(T, 4)
for i in 1:4
@inbounds stacks[i] = rand(levels)
end
stacks
end

function g4(T)
levels = T[.7, .8, .9, 1.]
stacks = zeros(T, 4)
for i in 1:4
@inbounds stacks[i] = rand(levels)
end
stacks
end

function g5(T)
levels = range(T(.7), T(1), 4)
stacks = zeros(T, 4)
for i in 1:4
@inbounds stacks[i] = rand(levels)
end
stacks
end

T = Float32
@benchmark g1(T)
@benchmark g2(T)
@benchmark g3(T)
@benchmark g4(T)
@benchmark g5(T)

``````

Results:

``````julia> @benchmark g1(T)
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
Range (min β¦ max):  2.378 ΞΌs β¦ 359.733 ΞΌs  β GC (min β¦ max): 0.00% β¦ 98.46%
Time  (median):     2.444 ΞΌs               β GC (median):    0.00%
Time  (mean Β± Ο):   2.652 ΞΌs Β±   3.597 ΞΌs  β GC (mean Β± Ο):  1.34% Β±  0.98%

βββββββββββββββββββββββ                                     β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
2.38 ΞΌs      Histogram: log(frequency) by time      4.51 ΞΌs <

Memory estimate: 1.12 KiB, allocs estimate: 24.

julia> @benchmark g2(T)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min β¦ max):  1.500 ΞΌs β¦ 330.450 ΞΌs  β GC (min β¦ max): 0.00% β¦ 99.33%
Time  (median):     1.550 ΞΌs               β GC (median):    0.00%
Time  (mean Β± Ο):   1.712 ΞΌs Β±   3.392 ΞΌs  β GC (mean Β± Ο):  1.92% Β±  0.99%

βββββββββ βββββββββ                                         β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
1.5 ΞΌs       Histogram: log(frequency) by time      2.97 ΞΌs <

Memory estimate: 688 bytes, allocs estimate: 15.

julia> @benchmark g3(T)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min β¦ max):  1.490 ΞΌs β¦ 347.160 ΞΌs  β GC (min β¦ max): 0.00% β¦ 99.19%
Time  (median):     1.540 ΞΌs               β GC (median):    0.00%
Time  (mean Β± Ο):   1.723 ΞΌs Β±   3.479 ΞΌs  β GC (mean Β± Ο):  2.00% Β±  0.99%

ββββββββββββββ                                     ββ       β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
1.49 ΞΌs      Histogram: log(frequency) by time      3.42 ΞΌs <

Memory estimate: 688 bytes, allocs estimate: 15.

julia> @benchmark g4(T)
BenchmarkTools.Trial: 10000 samples with 184 evaluations.
Range (min β¦ max):  555.978 ns β¦  10.843 ΞΌs  β GC (min β¦ max): 0.00% β¦ 94.30%
Time  (median):     569.565 ns               β GC (median):    0.00%
Time  (mean Β± Ο):   596.218 ns Β± 313.435 ns  β GC (mean Β± Ο):  1.51% Β±  2.83%

ββββββββββ     βββββ                                          β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
556 ns        Histogram: log(frequency) by time        898 ns <

Memory estimate: 224 bytes, allocs estimate: 6.

julia> @benchmark g5(T)
BenchmarkTools.Trial: 10000 samples with 278 evaluations.
Range (min β¦ max):  284.532 ns β¦   8.292 ΞΌs  β GC (min β¦ max): 0.00% β¦ 89.91%
Time  (median):     287.770 ns               β GC (median):    0.00%
Time  (mean Β± Ο):   306.729 ns Β± 142.426 ns  β GC (mean Β± Ο):  0.83% Β±  1.87%

ββββ ββββββ                                                   β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
285 ns        Histogram: log(frequency) by time        578 ns <

Memory estimate: 80 bytes, allocs estimate: 1.
``````

As can be seen, both `@SVector` and `@MVector` contribute to around 1ms run time and 9 allocations, while `Array` is fast and allocates little, contrary to usual experience. `g5` isnβt fair to compare as it exploits a pattern in the input, but it shows how much better `g(4)` can still improve.

versioninfo()
``````julia> versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65e (2023-01-08 06:45 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 8 Γ Intel(R) Core(TM) i5-9300H CPU @ 2.40GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
Threads: 1 on 8 virtual cores
Environment:
JULIA_PKG_SERVER = https://mirrors.bfsu.edu.cn/julia
``````
Interesting, I can reproduce the slowness. Iβm not sure what causes it, but you can get it a bit faster for this particular example by using tuples directly

``````function g1(rng, ::Type{T}) where T
levels = (T(.7), T(.8), T(.9), T(1.))
nt = ntuple(i->rand(rng, levels), 4)
SVector(nt)
end

@benchmark g1(rng, \$T)

julia> @benchmark g1(rng, T) # T = Float32
BenchmarkTools.Trial: 10000 samples with 994 evaluations.
Range (min β¦ max):  30.248 ns β¦  10.829 ΞΌs  β GC (min β¦ max): 0.00% β¦ 99.47%
Time  (median):     31.972 ns               β GC (median):    0.00%
Time  (mean Β± Ο):   37.578 ns Β± 182.827 ns  β GC (mean Β± Ο):  8.41% Β±  1.72%

βββββββ              βββββββββββββ                          β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
30.2 ns       Histogram: log(frequency) by time      51.2 ns <

Memory estimate: 48 bytes, allocs estimate: 1.
``````

I get a quite significant slow down from using `Float32` compared to `Float64`:

``````julia> @benchmark g1(\$rng, Float64)
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
Range (min β¦ max):  7.310 ns β¦ 18.332 ns  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     7.481 ns              β GC (median):    0.00%
Time  (mean Β± Ο):   7.569 ns Β±  0.296 ns  β GC (mean Β± Ο):  0.00% Β± 0.00%

β    βββββββ                  β βββ          β           β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
7.31 ns      Histogram: log(frequency) by time     8.62 ns <

Memory estimate: 0 bytes, allocs estimate: 0.
``````

The problem is just that youβre benchmarking with `T` as a global variable.

``````julia> let T = Float32
@btime g1(\$T) # <----- Note the \$ to interpolate into the benchmark
@btime g2(\$T)
@btime g3(\$T)
@btime g4(\$T)
@btime g5(\$T)
end
72.298 ns (1 allocation: 32 bytes)
1.100 ΞΌs (15 allocations: 624 bytes)
2.617 ΞΌs (26 allocations: 1.55 KiB)
378.493 ns (6 allocations: 224 bytes)
163.829 ns (1 allocation: 80 bytes)
``````

Thereβs a further problem here though, since `g1` is significantly slower than it should be when called this way. Compare to:

``````julia> @btime g1(Float32)
13.337 ns (1 allocation: 32 bytes)
``````
``````julia> @btime g1(\$Float32);
113.066 ns (1 allocation: 32 bytes)

julia> @btime g1(Float32);
19.655 ns (1 allocation: 32 bytes)
``````

I think not interpolating the type permits some constant propagation, which may not be what we want while benchmarking.

Guys I found the reason myself.

First, as @Mason points out, declare `const T = Float32` would improve performance as it avoids mutable global and type instability. This attributes to ~90ns delay in all variants, but can not explain the difference among them.
Using `@btime g1(\$T)` instead of `@btime g1(T)` will introduce similar effects like non-const `T` does, resulting ~90ns delay, too. This delay does not stack with the previous one, so it is probably due to dynamic dispatch.

Second, Julia does not specialize the type parameter `T`, as per Be aware of when Julia avoids specializing. Change the signature from `function g1(T)` to function `g1(::Type{T}) where T` changes the game. `g1` now takes ~30ns on my computer, `g2` and `g3` ~60ns, and `g4` ~90ns.

As an aside, I suggest returning `SVector(stacks)` whenever `stacks` is an `MVector`.

