Performance regression with StaticArrays?

Why in this example, the @SVector and @MVector macros both make performance worse?

Code: (you can copy and paste into REPL)

using BenchmarkTools
using StaticArrays
using Random

function g1(T)
    levels = @SVector T[.7, .8, .9, 1.]
    stacks = @MVector zeros(T, 4)
    for i in 1:4
        @inbounds stacks[i] = rand(levels)
    end
    stacks
end

function g2(T)
    levels = @SVector T[.7, .8, .9, 1.]
    stacks = zeros(T, 4)
    for i in 1:4
        @inbounds stacks[i] = rand(levels)
    end
    stacks
end

function g3(T)
    levels = T[.7, .8, .9, 1.]
    stacks = @MVector zeros(T, 4)
    for i in 1:4
        @inbounds stacks[i] = rand(levels)
    end
    stacks
end

function g4(T)
    levels = T[.7, .8, .9, 1.]
    stacks = zeros(T, 4)
    for i in 1:4
        @inbounds stacks[i] = rand(levels)
    end
    stacks
end

function g5(T)
    levels = range(T(.7), T(1), 4)
    stacks = zeros(T, 4)
    for i in 1:4
        @inbounds stacks[i] = rand(levels)
    end
    stacks
end

T = Float32
@benchmark g1(T)
@benchmark g2(T)
@benchmark g3(T)
@benchmark g4(T)
@benchmark g5(T)

Results:

julia> @benchmark g1(T)
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  2.378 ΞΌs … 359.733 ΞΌs  β”Š GC (min … max): 0.00% … 98.46%
 Time  (median):     2.444 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   2.652 ΞΌs Β±   3.597 ΞΌs  β”Š GC (mean Β± Οƒ):  1.34% Β±  0.98%

  β–ˆβ–ˆβ–†β–„β–„β–„β–‚β–‚β–…β–…β–„β–ƒβ–ƒβ–„β–ƒβ–‚β–ƒβ–ƒβ–‚β–β–β–‚β–                                     β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–ˆβ–‡β–‡β–…β–…β–ƒβ–‡β–…β–†β–…β–…β–…β–…β–…β–…β–„β–…β–…β–…β–ƒβ–„β–ƒβ–…β–„β–„β–β–ƒβ–…β–…β–… β–ˆ
  2.38 ΞΌs      Histogram: log(frequency) by time      4.51 ΞΌs <

 Memory estimate: 1.12 KiB, allocs estimate: 24.

julia> @benchmark g2(T)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.500 ΞΌs … 330.450 ΞΌs  β”Š GC (min … max): 0.00% … 99.33%
 Time  (median):     1.550 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   1.712 ΞΌs Β±   3.392 ΞΌs  β”Š GC (mean Β± Οƒ):  1.92% Β±  0.99%

  β–‡β–ˆβ–†β–‚β–β–β–†β–„β–‚ ▃▄▅▄▂▁▂▂▂                                         β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–‡β–ˆβ–†β–†β–†β–†β–…β–„β–„β–…β–…β–†β–„β–‚β–„β–„β–„β–„β–‚β–„β–„β–„β–ƒβ–„β–ƒβ–„β–‚β–„β–‚β–„β–„β–…β–„β–†β–„β–… β–ˆ
  1.5 ΞΌs       Histogram: log(frequency) by time      2.97 ΞΌs <

 Memory estimate: 688 bytes, allocs estimate: 15.

julia> @benchmark g3(T)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.490 ΞΌs … 347.160 ΞΌs  β”Š GC (min … max): 0.00% … 99.19%
 Time  (median):     1.540 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   1.723 ΞΌs Β±   3.479 ΞΌs  β”Š GC (mean Β± Οƒ):  2.00% Β±  0.99%

  β–ˆβ–‡β–ƒβ–‚β–†β–„β–β–„β–„β–„β–ƒβ–‚β–‚β–‚                                     ▂▁       β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–ˆβ–†β–†β–†β–†β–…β–„β–…β–„β–„β–ƒβ–„β–„β–„β–„β–β–„β–β–„β–ƒβ–„β–…β–„β–ƒβ–„β–ƒβ–„β–„β–β–„β–…β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–…β–… β–ˆ
  1.49 ΞΌs      Histogram: log(frequency) by time      3.42 ΞΌs <

 Memory estimate: 688 bytes, allocs estimate: 15.

julia> @benchmark g4(T)
BenchmarkTools.Trial: 10000 samples with 184 evaluations.
 Range (min … max):  555.978 ns …  10.843 ΞΌs  β”Š GC (min … max): 0.00% … 94.30%
 Time  (median):     569.565 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   596.218 ns Β± 313.435 ns  β”Š GC (mean Β± Οƒ):  1.51% Β±  2.83%

  β–„β–ˆβ–†β–‡β–†β–„β–ƒβ–β–β–     ▁▂▂▂▂                                          β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–‡β–†β–†β–‡β–†β–†β–…β–…β–„β–ƒβ–…β–†β–…β–…β–ƒβ–…β–ƒβ–„β–„β–„β–„β–„β–ƒβ–…β–„β–β–†β–‡β–†β–„β–…β–β–„β–ƒβ–„β–ƒβ–„β–… β–ˆ
  556 ns        Histogram: log(frequency) by time        898 ns <

 Memory estimate: 224 bytes, allocs estimate: 6.

julia> @benchmark g5(T)
BenchmarkTools.Trial: 10000 samples with 278 evaluations.
 Range (min … max):  284.532 ns …   8.292 ΞΌs  β”Š GC (min … max): 0.00% … 89.91%
 Time  (median):     287.770 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   306.729 ns Β± 142.426 ns  β”Š GC (mean Β± Οƒ):  0.83% Β±  1.87%

  β–ˆβ–†β–…β–‚ ▃▃▃▃▂▁                                                   ▁
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–†β–…β–ˆβ–ˆβ–†β–†β–„β–ƒβ–„β–ƒβ–„β–ƒβ–„β–ƒβ–‚β–„β–ƒβ–„β–ƒβ–ƒβ–ƒβ–„β–ƒβ–…β–ƒβ–ƒβ–…β–…β–„β–…β–…β–…β–…β–…β–„β–ƒβ–†β–„β–ƒβ–…β–…β–ƒβ–„β–„β–„ β–ˆ
  285 ns        Histogram: log(frequency) by time        578 ns <

 Memory estimate: 80 bytes, allocs estimate: 1.

As can be seen, both @SVector and @MVector contribute to around 1ms run time and 9 allocations, while Array is fast and allocates little, contrary to usual experience. g5 isn’t fair to compare as it exploits a pattern in the input, but it shows how much better g(4) can still improve.

versioninfo()
julia> versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65e (2023-01-08 06:45 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 Γ— Intel(R) Core(TM) i5-9300H CPU @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
  Threads: 1 on 8 virtual cores
Environment:
  JULIA_PKG_SERVER = https://mirrors.bfsu.edu.cn/julia
Project.toml
[deps]
BenchmarkTools = "6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf"
StaticArrays = "90137ffa-7385-5640-81b9-e52037218182"
Manifest.toml
# This file is machine-generated - editing it directly is not advised

julia_version = "1.8.5"
manifest_format = "2.0"
project_hash = "af012333bc8fedec0067ac10f97df20a71800fd7"

[[deps.Artifacts]]
uuid = "56f22d72-fd6d-98f1-02f0-08ddc0907c33"

[[deps.BenchmarkTools]]
deps = ["JSON", "Logging", "Printf", "Profile", "Statistics", "UUIDs"]
git-tree-sha1 = "d9a9701b899b30332bbcb3e1679c41cce81fb0e8"
uuid = "6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf"
version = "1.3.2"

[[deps.CompilerSupportLibraries_jll]]
deps = ["Artifacts", "Libdl"]
uuid = "e66e0078-7015-5450-92f7-15fbd957f2ae"
version = "1.0.1+0"

[[deps.Dates]]
deps = ["Printf"]
uuid = "ade2ca70-3891-5945-98fb-dc099432e06a"

[[deps.JSON]]
deps = ["Dates", "Mmap", "Parsers", "Unicode"]
git-tree-sha1 = "3c837543ddb02250ef42f4738347454f95079d4e"
uuid = "682c06a0-de6a-54ab-a142-c8b1cf79cde6"
version = "0.21.3"

[[deps.Libdl]]
uuid = "8f399da3-3557-5675-b5ff-fb832c97cbdb"

[[deps.LinearAlgebra]]
deps = ["Libdl", "libblastrampoline_jll"]
uuid = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"

[[deps.Logging]]
uuid = "56ddb016-857b-54e1-b83d-db4d58db5568"

[[deps.Mmap]]
uuid = "a63ad114-7e13-5084-954f-fe012c677804"

[[deps.OpenBLAS_jll]]
deps = ["Artifacts", "CompilerSupportLibraries_jll", "Libdl"]
uuid = "4536629a-c528-5b80-bd46-f80d51c5b363"
version = "0.3.20+0"

[[deps.Parsers]]
deps = ["Dates", "SnoopPrecompile"]
git-tree-sha1 = "8175fc2b118a3755113c8e68084dc1a9e63c61ee"
uuid = "69de0a69-1ddd-5017-9359-2bf0b02dc9f0"
version = "2.5.3"

[[deps.Preferences]]
deps = ["TOML"]
git-tree-sha1 = "47e5f437cc0e7ef2ce8406ce1e7e24d44915f88d"
uuid = "21216c6a-2e73-6563-6e65-726566657250"
version = "1.3.0"

[[deps.Printf]]
deps = ["Unicode"]
uuid = "de0858da-6303-5e67-8744-51eddeeeb8d7"

[[deps.Profile]]
deps = ["Printf"]
uuid = "9abbd945-dff8-562f-b5e8-e1ebf5ef1b79"

[[deps.Random]]
deps = ["SHA", "Serialization"]
uuid = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"

[[deps.SHA]]
uuid = "ea8e919c-243c-51af-8825-aaa63cd721ce"
version = "0.7.0"

[[deps.Serialization]]
uuid = "9e88b42a-f829-5b0c-bbe9-9e923198166b"

[[deps.SnoopPrecompile]]
deps = ["Preferences"]
git-tree-sha1 = "e760a70afdcd461cf01a575947738d359234665c"
uuid = "66db9d55-30c0-4569-8b51-7e840670fc0c"
version = "1.0.3"

[[deps.SparseArrays]]
deps = ["LinearAlgebra", "Random"]
uuid = "2f01184e-e22b-5df5-ae63-d93ebab69eaf"

[[deps.StaticArrays]]
deps = ["LinearAlgebra", "Random", "StaticArraysCore", "Statistics"]
git-tree-sha1 = "6954a456979f23d05085727adb17c4551c19ecd1"
uuid = "90137ffa-7385-5640-81b9-e52037218182"
version = "1.5.12"

[[deps.StaticArraysCore]]
git-tree-sha1 = "6b7ba252635a5eff6a0b0664a41ee140a1c9e72a"
uuid = "1e83bf80-4336-4d27-bf5d-d5a4f845583c"
version = "1.4.0"

[[deps.Statistics]]
deps = ["LinearAlgebra", "SparseArrays"]
uuid = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"

[[deps.TOML]]
deps = ["Dates"]
uuid = "fa267f1f-6049-4f14-aa54-33bafae1ed76"
version = "1.0.0"

[[deps.UUIDs]]
deps = ["Random", "SHA"]
uuid = "cf7118a7-6976-5b1a-9a39-7adc72f591a4"

[[deps.Unicode]]
uuid = "4ec0a83e-493e-50e2-b9ac-8f72acf5a8f5"

[[deps.libblastrampoline_jll]]
deps = ["Artifacts", "Libdl", "OpenBLAS_jll"]
uuid = "8e850b90-86db-534c-a0d3-1478176c7d93"
version = "5.1.1+0"

Interesting, I can reproduce the slowness. I’m not sure what causes it, but you can get it a bit faster for this particular example by using tuples directly

function g1(rng, ::Type{T}) where T
    levels = (T(.7), T(.8), T(.9), T(1.))
    nt = ntuple(i->rand(rng, levels), 4)
    SVector(nt)
end

@benchmark g1(rng, $T)

julia> @benchmark g1(rng, T) # T = Float32
BenchmarkTools.Trial: 10000 samples with 994 evaluations.
 Range (min … max):  30.248 ns …  10.829 ΞΌs  β”Š GC (min … max): 0.00% … 99.47%
 Time  (median):     31.972 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   37.578 ns Β± 182.827 ns  β”Š GC (mean Β± Οƒ):  8.41% Β±  1.72%

   β–β–…β–‡β–ˆβ–‡β–„β–‚              ▁▂▄▄▅▄▄▃▂▂▁▁▁                          β–‚
  β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–‡β–‡β–‡β–†β–†β–†β–†β–†β–β–…β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–…β–‡β–‡β–ˆβ–ˆβ–ˆβ–†β–†β–†β–…β–†β–…β–…β–†β–†β–…β–†β–†β–…β–†β–† β–ˆ
  30.2 ns       Histogram: log(frequency) by time      51.2 ns <

 Memory estimate: 48 bytes, allocs estimate: 1.

I get a quite significant slow down from using Float32 compared to Float64:

julia> @benchmark g1($rng, Float64)
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):  7.310 ns … 18.332 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     7.481 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   7.569 ns Β±  0.296 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ƒ    β–…β–ƒβ–ˆβ–†β–…β–ƒβ–…                  ▁ ▃▃▁          β–‚           β–‚ β–‚
  β–ˆβ–β–β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ƒβ–β–β–β–β–ƒβ–β–β–β–β–β–β–…β–†β–β–β–β–β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–ˆβ–‡β–β–β–ƒβ–β–„β–…β–†β–ˆβ–‡β–ˆβ–„β–‡β–ƒβ–ƒβ–ƒβ–ƒβ–β–‡β–†β–ˆ β–ˆ
  7.31 ns      Histogram: log(frequency) by time     8.62 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

The problem is just that you’re benchmarking with T as a global variable.

julia> let T = Float32
       @btime g1($T) # <----- Note the $ to interpolate into the benchmark
       @btime g2($T)
       @btime g3($T)
       @btime g4($T)
       @btime g5($T)
       end
  72.298 ns (1 allocation: 32 bytes)
  1.100 ΞΌs (15 allocations: 624 bytes)
  2.617 ΞΌs (26 allocations: 1.55 KiB)
  378.493 ns (6 allocations: 224 bytes)
  163.829 ns (1 allocation: 80 bytes)

There’s a further problem here though, since g1 is significantly slower than it should be when called this way. Compare to:

julia> @btime g1(Float32)
  13.337 ns (1 allocation: 32 bytes)
1 Like
julia> @btime g1($Float32);
  113.066 ns (1 allocation: 32 bytes)

julia> @btime g1(Float32);
  19.655 ns (1 allocation: 32 bytes)

I think not interpolating the type permits some constant propagation, which may not be what we want while benchmarking.

Guys I found the reason myself.

First, as @Mason points out, declare const T = Float32 would improve performance as it avoids mutable global and type instability. This attributes to ~90ns delay in all variants, but can not explain the difference among them.
Using @btime g1($T) instead of @btime g1(T) will introduce similar effects like non-const T does, resulting ~90ns delay, too. This delay does not stack with the previous one, so it is probably due to dynamic dispatch.

Second, Julia does not specialize the type parameter T, as per Be aware of when Julia avoids specializing. Change the signature from function g1(T) to function g1(::Type{T}) where T changes the game. g1 now takes ~30ns on my computer, g2 and g3 ~60ns, and g4 ~90ns.

As an aside, I suggest returning SVector(stacks) whenever stacks is an MVector.

1 Like