Performance regression with StaticArrays?

melonedo · January 26, 2023, 5:48am

Why in this example, the @SVector and @MVector macros both make performance worse?

Code: (you can copy and paste into REPL)

using BenchmarkTools
using StaticArrays
using Random

function g1(T)
    levels = @SVector T[.7, .8, .9, 1.]
    stacks = @MVector zeros(T, 4)
    for i in 1:4
        @inbounds stacks[i] = rand(levels)
    end
    stacks
end

function g2(T)
    levels = @SVector T[.7, .8, .9, 1.]
    stacks = zeros(T, 4)
    for i in 1:4
        @inbounds stacks[i] = rand(levels)
    end
    stacks
end

function g3(T)
    levels = T[.7, .8, .9, 1.]
    stacks = @MVector zeros(T, 4)
    for i in 1:4
        @inbounds stacks[i] = rand(levels)
    end
    stacks
end

function g4(T)
    levels = T[.7, .8, .9, 1.]
    stacks = zeros(T, 4)
    for i in 1:4
        @inbounds stacks[i] = rand(levels)
    end
    stacks
end

function g5(T)
    levels = range(T(.7), T(1), 4)
    stacks = zeros(T, 4)
    for i in 1:4
        @inbounds stacks[i] = rand(levels)
    end
    stacks
end

T = Float32
@benchmark g1(T)
@benchmark g2(T)
@benchmark g3(T)
@benchmark g4(T)
@benchmark g5(T)

Results:

julia> @benchmark g1(T)
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  2.378 μs … 359.733 μs  ┊ GC (min … max): 0.00% … 98.46%
 Time  (median):     2.444 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.652 μs ±   3.597 μs  ┊ GC (mean ± σ):  1.34% ±  0.98%

  ██▆▄▄▄▂▂▅▅▄▃▃▄▃▂▃▃▂▁▁▂▁                                     ▂
  ██████████████████████████▇▇▇█▇▇▅▅▃▇▅▆▅▅▅▅▅▅▄▅▅▅▃▄▃▅▄▄▁▃▅▅▅ █
  2.38 μs      Histogram: log(frequency) by time      4.51 μs <

 Memory estimate: 1.12 KiB, allocs estimate: 24.

julia> @benchmark g2(T)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.500 μs … 330.450 μs  ┊ GC (min … max): 0.00% … 99.33%
 Time  (median):     1.550 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.712 μs ±   3.392 μs  ┊ GC (mean ± σ):  1.92% ±  0.99%

  ▇█▆▂▁▁▆▄▂ ▃▄▅▄▂▁▂▂▂                                         ▂
  █████████▇███████████▇█▇█▆▆▆▆▅▄▄▅▅▆▄▂▄▄▄▄▂▄▄▄▃▄▃▄▂▄▂▄▄▅▄▆▄▅ █
  1.5 μs       Histogram: log(frequency) by time      2.97 μs <

 Memory estimate: 688 bytes, allocs estimate: 15.

julia> @benchmark g3(T)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.490 μs … 347.160 μs  ┊ GC (min … max): 0.00% … 99.19%
 Time  (median):     1.540 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.723 μs ±   3.479 μs  ┊ GC (mean ± σ):  2.00% ±  0.99%

  █▇▃▂▆▄▁▄▄▄▃▂▂▂                                     ▂▁       ▂
  ███████████████▇███▆▆▆▆▅▄▅▄▄▃▄▄▄▄▁▄▁▄▃▄▅▄▃▄▃▄▄▁▄▅▇████▇▇▆▅▅ █
  1.49 μs      Histogram: log(frequency) by time      3.42 μs <

 Memory estimate: 688 bytes, allocs estimate: 15.

julia> @benchmark g4(T)
BenchmarkTools.Trial: 10000 samples with 184 evaluations.
 Range (min … max):  555.978 ns …  10.843 μs  ┊ GC (min … max): 0.00% … 94.30%
 Time  (median):     569.565 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   596.218 ns ± 313.435 ns  ┊ GC (mean ± σ):  1.51% ±  2.83%

  ▄█▆▇▆▄▃▁▁▁     ▁▂▂▂▂                                          ▂
  ██████████▇▇▇████████▇▆▇▆▆▇▆▆▅▅▄▃▅▆▅▅▃▅▃▄▄▄▄▄▃▅▄▁▆▇▆▄▅▁▄▃▄▃▄▅ █
  556 ns        Histogram: log(frequency) by time        898 ns <

 Memory estimate: 224 bytes, allocs estimate: 6.

julia> @benchmark g5(T)
BenchmarkTools.Trial: 10000 samples with 278 evaluations.
 Range (min … max):  284.532 ns …   8.292 μs  ┊ GC (min … max): 0.00% … 89.91%
 Time  (median):     287.770 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   306.729 ns ± 142.426 ns  ┊ GC (mean ± σ):  0.83% ±  1.87%

  █▆▅▂ ▃▃▃▃▂▁                                                   ▁
  █████████████▇▇▆▆▅██▆▆▄▃▄▃▄▃▄▃▂▄▃▄▃▃▃▄▃▅▃▃▅▅▄▅▅▅▅▅▄▃▆▄▃▅▅▃▄▄▄ █
  285 ns        Histogram: log(frequency) by time        578 ns <

 Memory estimate: 80 bytes, allocs estimate: 1.

As can be seen, both @SVector and @MVector contribute to around 1ms run time and 9 allocations, while Array is fast and allocates little, contrary to usual experience. g5 isn’t fair to compare as it exploits a pattern in the input, but it shows how much better g(4) can still improve.

versioninfo()

julia> versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65e (2023-01-08 06:45 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 × Intel(R) Core(TM) i5-9300H CPU @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
  Threads: 1 on 8 virtual cores
Environment:
  JULIA_PKG_SERVER = https://mirrors.bfsu.edu.cn/julia

Project.toml

[deps]
BenchmarkTools = "6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf"
StaticArrays = "90137ffa-7385-5640-81b9-e52037218182"

Manifest.toml

# This file is machine-generated - editing it directly is not advised

julia_version = "1.8.5"
manifest_format = "2.0"
project_hash = "af012333bc8fedec0067ac10f97df20a71800fd7"

[[deps.Artifacts]]
uuid = "56f22d72-fd6d-98f1-02f0-08ddc0907c33"

[[deps.BenchmarkTools]]
deps = ["JSON", "Logging", "Printf", "Profile", "Statistics", "UUIDs"]
git-tree-sha1 = "d9a9701b899b30332bbcb3e1679c41cce81fb0e8"
uuid = "6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf"
version = "1.3.2"

[[deps.CompilerSupportLibraries_jll]]
deps = ["Artifacts", "Libdl"]
uuid = "e66e0078-7015-5450-92f7-15fbd957f2ae"
version = "1.0.1+0"

[[deps.Dates]]
deps = ["Printf"]
uuid = "ade2ca70-3891-5945-98fb-dc099432e06a"

[[deps.JSON]]
deps = ["Dates", "Mmap", "Parsers", "Unicode"]
git-tree-sha1 = "3c837543ddb02250ef42f4738347454f95079d4e"
uuid = "682c06a0-de6a-54ab-a142-c8b1cf79cde6"
version = "0.21.3"

[[deps.Libdl]]
uuid = "8f399da3-3557-5675-b5ff-fb832c97cbdb"

[[deps.LinearAlgebra]]
deps = ["Libdl", "libblastrampoline_jll"]
uuid = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"

[[deps.Logging]]
uuid = "56ddb016-857b-54e1-b83d-db4d58db5568"

[[deps.Mmap]]
uuid = "a63ad114-7e13-5084-954f-fe012c677804"

[[deps.OpenBLAS_jll]]
deps = ["Artifacts", "CompilerSupportLibraries_jll", "Libdl"]
uuid = "4536629a-c528-5b80-bd46-f80d51c5b363"
version = "0.3.20+0"

[[deps.Parsers]]
deps = ["Dates", "SnoopPrecompile"]
git-tree-sha1 = "8175fc2b118a3755113c8e68084dc1a9e63c61ee"
uuid = "69de0a69-1ddd-5017-9359-2bf0b02dc9f0"
version = "2.5.3"

[[deps.Preferences]]
deps = ["TOML"]
git-tree-sha1 = "47e5f437cc0e7ef2ce8406ce1e7e24d44915f88d"
uuid = "21216c6a-2e73-6563-6e65-726566657250"
version = "1.3.0"

[[deps.Printf]]
deps = ["Unicode"]
uuid = "de0858da-6303-5e67-8744-51eddeeeb8d7"

[[deps.Profile]]
deps = ["Printf"]
uuid = "9abbd945-dff8-562f-b5e8-e1ebf5ef1b79"

[[deps.Random]]
deps = ["SHA", "Serialization"]
uuid = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"

[[deps.SHA]]
uuid = "ea8e919c-243c-51af-8825-aaa63cd721ce"
version = "0.7.0"

[[deps.Serialization]]
uuid = "9e88b42a-f829-5b0c-bbe9-9e923198166b"

[[deps.SnoopPrecompile]]
deps = ["Preferences"]
git-tree-sha1 = "e760a70afdcd461cf01a575947738d359234665c"
uuid = "66db9d55-30c0-4569-8b51-7e840670fc0c"
version = "1.0.3"

[[deps.SparseArrays]]
deps = ["LinearAlgebra", "Random"]
uuid = "2f01184e-e22b-5df5-ae63-d93ebab69eaf"

[[deps.StaticArrays]]
deps = ["LinearAlgebra", "Random", "StaticArraysCore", "Statistics"]
git-tree-sha1 = "6954a456979f23d05085727adb17c4551c19ecd1"
uuid = "90137ffa-7385-5640-81b9-e52037218182"
version = "1.5.12"

[[deps.StaticArraysCore]]
git-tree-sha1 = "6b7ba252635a5eff6a0b0664a41ee140a1c9e72a"
uuid = "1e83bf80-4336-4d27-bf5d-d5a4f845583c"
version = "1.4.0"

[[deps.Statistics]]
deps = ["LinearAlgebra", "SparseArrays"]
uuid = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"

[[deps.TOML]]
deps = ["Dates"]
uuid = "fa267f1f-6049-4f14-aa54-33bafae1ed76"
version = "1.0.0"

[[deps.UUIDs]]
deps = ["Random", "SHA"]
uuid = "cf7118a7-6976-5b1a-9a39-7adc72f591a4"

[[deps.Unicode]]
uuid = "4ec0a83e-493e-50e2-b9ac-8f72acf5a8f5"

[[deps.libblastrampoline_jll]]
deps = ["Artifacts", "Libdl", "OpenBLAS_jll"]
uuid = "8e850b90-86db-534c-a0d3-1478176c7d93"
version = "5.1.1+0"

baggepinnen · January 26, 2023, 6:05am

Interesting, I can reproduce the slowness. I’m not sure what causes it, but you can get it a bit faster for this particular example by using tuples directly

function g1(rng, ::Type{T}) where T
    levels = (T(.7), T(.8), T(.9), T(1.))
    nt = ntuple(i->rand(rng, levels), 4)
    SVector(nt)
end

@benchmark g1(rng, $T)

julia> @benchmark g1(rng, T) # T = Float32
BenchmarkTools.Trial: 10000 samples with 994 evaluations.
 Range (min … max):  30.248 ns …  10.829 μs  ┊ GC (min … max): 0.00% … 99.47%
 Time  (median):     31.972 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   37.578 ns ± 182.827 ns  ┊ GC (mean ± σ):  8.41% ±  1.72%

   ▁▅▇█▇▄▂              ▁▂▄▄▅▄▄▃▂▂▁▁▁                          ▂
  ▆████████▇█▇▇▇▆▆▆▆▆▁▅▆██████████████▇▇▆▅▇▇███▆▆▆▅▆▅▅▆▆▅▆▆▅▆▆ █
  30.2 ns       Histogram: log(frequency) by time      51.2 ns <

 Memory estimate: 48 bytes, allocs estimate: 1.

I get a quite significant slow down from using Float32 compared to Float64:

julia> @benchmark g1($rng, Float64)
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):  7.310 ns … 18.332 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     7.481 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.569 ns ±  0.296 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃    ▅▃█▆▅▃▅                  ▁ ▃▃▁          ▂           ▂ ▂
  █▁▁▆████████▃▁▁▁▁▃▁▁▁▁▁▁▅▆▁▁▁▁█████▆█▇▁▁▃▁▄▅▆█▇█▄▇▃▃▃▃▁▇▆█ █
  7.31 ns      Histogram: log(frequency) by time     8.62 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Mason · January 26, 2023, 6:09am

The problem is just that you’re benchmarking with T as a global variable.

julia> let T = Float32
       @btime g1($T) # <----- Note the $ to interpolate into the benchmark
       @btime g2($T)
       @btime g3($T)
       @btime g4($T)
       @btime g5($T)
       end
  72.298 ns (1 allocation: 32 bytes)
  1.100 μs (15 allocations: 624 bytes)
  2.617 μs (26 allocations: 1.55 KiB)
  378.493 ns (6 allocations: 224 bytes)
  163.829 ns (1 allocation: 80 bytes)

There’s a further problem here though, since g1 is significantly slower than it should be when called this way. Compare to:

julia> @btime g1(Float32)
  13.337 ns (1 allocation: 32 bytes)

jishnub · January 26, 2023, 7:15am

julia> @btime g1($Float32);
  113.066 ns (1 allocation: 32 bytes)

julia> @btime g1(Float32);
  19.655 ns (1 allocation: 32 bytes)

I think not interpolating the type permits some constant propagation, which may not be what we want while benchmarking.

melonedo · January 27, 2023, 11:36am

Guys I found the reason myself.

First, as @Mason points out, declare const T = Float32 would improve performance as it avoids mutable global and type instability. This attributes to ~90ns delay in all variants, but can not explain the difference among them.
Using @btime g1($T) instead of @btime g1(T) will introduce similar effects like non-const T does, resulting ~90ns delay, too. This delay does not stack with the previous one, so it is probably due to dynamic dispatch.

Second, Julia does not specialize the type parameter T, as per Be aware of when Julia avoids specializing. Change the signature from function g1(T) to function g1(::Type{T}) where T changes the game. g1 now takes ~30ns on my computer, g2 and g3 ~60ns, and g4 ~90ns.

Elrod · January 27, 2023, 11:53am

As an aside, I suggest returning SVector(stacks) whenever stacks is an MVector.

Topic		Replies	Views
Usage of arrays of static arrays New to Julia performance , staticarrays	16	1106	February 22, 2023
Memory allocations when returning vectors General Usage array , memory-allocation	15	1491	June 6, 2018
Avoiding allocations of small but non-trivial arrays (work array alternative?) Performance question	38	4487	November 17, 2022
Performance of mutable static arrays and compilation cost General Usage	15	1188	July 17, 2018
SVector vs Vec usage: Why do I have an 8x speedup in a simple example? Performance	7	1041	August 17, 2019

Performance regression with StaticArrays?

Related topics