I’m currently rewriting some low-level code in one of my packages. I’m seeing that the new function runs 2-3 times slower than the old one (~12ns instead of ~5ns) although @code_native shows exactly the same code. The slowdown only appears when benchmarking a single function call. When mapping over a vector of inputs, the difference is gone. Does anybody have an idea what’s going on here?
I’m comparing the function reverse between the master branch and the branch shuffle128 of my package SmallCollections.jl. The native code is different for processors having AVX2 and AVX-512.
Processor with AVX2 (no AVX-512):
julia> using Chairmarks, SmallCollections # master
julia> N = 32; T = Int8; v = FixedVector{N,T}(1:N); i = 3; @b reverse($v, $i)
5.807 ns
julia> M = 1000; p = [rand(FixedVector{N,T}) for _ in 1:M]; q = rand(1:N, M);
julia> @b similar(p) map!(reverse, _, $p, $q)
5.201 μs
julia> using Chairmarks, SmallCollections # shuffle128
julia> N = 32; T = Int8; v = FixedVector{N,T}(1:N); i = 3; @b reverse($v, $i)
11.702 ns
julia> M = 1000; p = [rand(FixedVector{N,T}) for _ in 1:M]; q = rand(1:N, M);
julia> @b similar(p) map!(reverse, _, $p, $q)
5.203 μs
Processor with AVX-512:
julia> using Chairmarks, SmallCollections # master
julia> N = 64; T = UInt8; v = rand(FixedVector{N,T}); i = 3; @b reverse($v, $i)
4.873 ns
julia> M = 1000; p = [rand(FixedVector{N,T}) for _ in 1:M]; q = rand(1:N, M);
julia> @b similar(p) map!(reverse, _, $p, $q)
7.882 μs
julia> using Chairmarks, SmallCollections # shuffle128
julia> N = 64; T = UInt8; v = rand(FixedVector{N,T}); i = 3; @b reverse($v, $i)
12.930 ns
julia> M = 1000; p = [rand(FixedVector{N,T}) for _ in 1:M]; q = rand(1:N, M);
julia> @b similar(p) map!(reverse, _, $p, $q)
7.884 μs
Nano-benchmarks are tricky to benchmark. In general you’ll get much more trustworthy results with benchmarks that take >1μs or ideally ms.
I can’t reproduce this, but that doesn’t mean it isn’t real.
(@v1.12) pkg> activate --temp
Activating new project at `/tmp/jl_nu4uCK`
julia> using SmallCollections, Chairmarks, BenchmarkTools
│ Package SmallCollections not found, but a package named SmallCollections is
│ available from a registry.
│ Install package?
│ (jl_nu4uCK) pkg> add SmallCollections
└ (y/n/o) [y]:
Updating registry at `~/.julia/registries/General.toml`
Resolving package versions...
Installed SmallCollections ─ v0.6.0
Updating `/tmp/jl_nu4uCK/Project.toml`
[2b935e18] + SmallCollections v0.6.0
Updating `/tmp/jl_nu4uCK/Manifest.toml`
[c3b6d118] + BitIntegers v0.3.7
[adafc99b] + CpuId v0.3.1
[2b935e18] + SmallCollections v0.6.0
[56f22d72] + Artifacts v1.11.0
[2a0f44e3] + Base64 v1.11.0
[ac6e5ff7] + JuliaSyntaxHighlighting v1.12.0
[8f399da3] + Libdl v1.11.0
[37e2e46d] + LinearAlgebra v1.12.0
[d6f4376e] + Markdown v1.11.0
[9a3f8284] + Random v1.11.0
[ea8e919c] + SHA v0.7.0
[f489334b] + StyledStrings v1.11.0
[e66e0078] + CompilerSupportLibraries_jll v1.3.0+1
[4536629a] + OpenBLAS_jll v0.3.29+0
[8e850b90] + libblastrampoline_jll v5.15.0+0
Precompiling SmallCollections finished.
3 dependencies successfully precompiled in 2 seconds. 4 already precompiled.
julia> N = 32; T = Int8; v = FixedVector{N,T}(1:N); i = 3;
julia> @b reverse($v, $i) # Chairmarks
13.697 ns
julia> @btime reverse($v, $i); # BenchmarkTools
13.639 ns (0 allocations: 0 bytes)
(jl_nu4uCK) pkg> st -m
Status `/tmp/jl_nu4uCK/Manifest.toml`
[c3b6d118] BitIntegers v0.3.7
[adafc99b] CpuId v0.3.1
[2b935e18] SmallCollections v0.6.0
[56f22d72] Artifacts v1.11.0
[2a0f44e3] Base64 v1.11.0
[ac6e5ff7] JuliaSyntaxHighlighting v1.12.0
[8f399da3] Libdl v1.11.0
[37e2e46d] LinearAlgebra v1.12.0
[d6f4376e] Markdown v1.11.0
[9a3f8284] Random v1.11.0
[ea8e919c] SHA v0.7.0
[f489334b] StyledStrings v1.11.0
[e66e0078] CompilerSupportLibraries_jll v1.3.0+1
[4536629a] OpenBLAS_jll v0.3.29+0
[8e850b90] libblastrampoline_jll v5.15.0+0
julia> versioninfo()
Julia Version 1.12.5
Commit 5fe89b8ddc1 (2026-02-09 16:05 UTC)
Build Info:
Official https://julialang.org release
Platform Info:
OS: Linux (aarch64-linux-gnu)
CPU: 8 × unknown
WORD_SIZE: 64
LLVM: libLLVM-18.1.7 (ORCJIT, apple-m2)
GC: Built with stock GC
Threads: 1 default, 1 interactive, 1 GC (on 8 virtual cores)
Environment:
JULIA_EDITOR = code
If you can produce a MWE for the identical code but different performance issue that doesn’t require the use of unregistered package versions (ideally just benchmarking packages and Base) I’d be happy to take a closer look.