Functions with identical native code but different benchmarks

I’m currently rewriting some low-level code in one of my packages. I’m seeing that the new function runs 2-3 times slower than the old one (~12ns instead of ~5ns) although @code_native shows exactly the same code. The slowdown only appears when benchmarking a single function call. When mapping over a vector of inputs, the difference is gone. Does anybody have an idea what’s going on here?

I’m comparing the function reverse between the master branch v0.6.0 (also in the Julia repository) and the branch shuffle128 discourse-136336 of my package SmallCollections.jl. The native code is different for processors having AVX2 and AVX-512.

Processor with AVX2 (no AVX-512):

julia> using Chairmarks, SmallCollections  # v0.6.0

julia> N = 32; T = Int8; v = FixedVector{N,T}(1:N); i = 3; @b reverse($v, $i)
5.807 ns

julia> M = 1000; p = [rand(FixedVector{N,T}) for _ in 1:M]; q = rand(1:N, M);

julia> @b similar(p) map!(reverse, _, $p, $q)
5.201 μs
julia> using Chairmarks, SmallCollections  # discourse-136336

julia> N = 32; T = Int8; v = FixedVector{N,T}(1:N); i = 3; @b reverse($v, $i)
11.702 ns

julia> M = 1000; p = [rand(FixedVector{N,T}) for _ in 1:M]; q = rand(1:N, M);

julia> @b similar(p) map!(reverse, _, $p, $q)
5.203 μs

Processor with AVX-512:

julia> using Chairmarks, SmallCollections  # v0.6.0

julia> N = 64; T = UInt8; v = rand(FixedVector{N,T}); i = 3; @b reverse($v, $i)
4.873 ns

julia> M = 1000; p = [rand(FixedVector{N,T}) for _ in 1:M]; q = rand(1:N, M);

julia> @b similar(p) map!(reverse, _, $p, $q)
7.882 μs
julia> using Chairmarks, SmallCollections  # discourse-136336

julia> N = 64; T = UInt8; v = rand(FixedVector{N,T}); i = 3; @b reverse($v, $i)
12.930 ns

julia> M = 1000; p = [rand(FixedVector{N,T}) for _ in 1:M]; q = rand(1:N, M);

julia> @b similar(p) map!(reverse, _, $p, $q)
7.884 μs

EDIT: All functions are defined with @inline.

@Lilith Could this be an issue with Chairmarks.jl?

julia> using SmallCollections, Chairmarks, BenchmarkTools

julia> N = 32; T = Int8; v = FixedVector{N,T}(1:N); i = 3;

julia> @b reverse($v, $i)  # Chairmarks
13.206 ns

julia> @btime reverse($v, $i);  # BenchmarkTools
  5.167 ns (0 allocations: 0 bytes)

Nano-benchmarks are tricky to benchmark. In general you’ll get much more trustworthy results with benchmarks that take >1μs or ideally ms.

I can’t reproduce this, but that doesn’t mean it isn’t real.

(@v1.12) pkg> activate --temp
  Activating new project at `/tmp/jl_nu4uCK`

julia> using SmallCollections, Chairmarks, BenchmarkTools
 │ Package SmallCollections not found, but a package named SmallCollections is
 │ available from a registry. 
 │ Install package?
 │   (jl_nu4uCK) pkg> add SmallCollections 
 └ (y/n/o) [y]: 
    Updating registry at `~/.julia/registries/General.toml`
   Resolving package versions...
   Installed SmallCollections ─ v0.6.0
    Updating `/tmp/jl_nu4uCK/Project.toml`
  [2b935e18] + SmallCollections v0.6.0
    Updating `/tmp/jl_nu4uCK/Manifest.toml`
  [c3b6d118] + BitIntegers v0.3.7
  [adafc99b] + CpuId v0.3.1
  [2b935e18] + SmallCollections v0.6.0
  [56f22d72] + Artifacts v1.11.0
  [2a0f44e3] + Base64 v1.11.0
  [ac6e5ff7] + JuliaSyntaxHighlighting v1.12.0
  [8f399da3] + Libdl v1.11.0
  [37e2e46d] + LinearAlgebra v1.12.0
  [d6f4376e] + Markdown v1.11.0
  [9a3f8284] + Random v1.11.0
  [ea8e919c] + SHA v0.7.0
  [f489334b] + StyledStrings v1.11.0
  [e66e0078] + CompilerSupportLibraries_jll v1.3.0+1
  [4536629a] + OpenBLAS_jll v0.3.29+0
  [8e850b90] + libblastrampoline_jll v5.15.0+0
Precompiling SmallCollections finished.
  3 dependencies successfully precompiled in 2 seconds. 4 already precompiled.

julia> N = 32; T = Int8; v = FixedVector{N,T}(1:N); i = 3;

julia> @b reverse($v, $i)  # Chairmarks
13.697 ns

julia> @btime reverse($v, $i);  # BenchmarkTools
  13.639 ns (0 allocations: 0 bytes)

(jl_nu4uCK) pkg> st -m
Status `/tmp/jl_nu4uCK/Manifest.toml`
  [c3b6d118] BitIntegers v0.3.7
  [adafc99b] CpuId v0.3.1
  [2b935e18] SmallCollections v0.6.0
  [56f22d72] Artifacts v1.11.0
  [2a0f44e3] Base64 v1.11.0
  [ac6e5ff7] JuliaSyntaxHighlighting v1.12.0
  [8f399da3] Libdl v1.11.0
  [37e2e46d] LinearAlgebra v1.12.0
  [d6f4376e] Markdown v1.11.0
  [9a3f8284] Random v1.11.0
  [ea8e919c] SHA v0.7.0
  [f489334b] StyledStrings v1.11.0
  [e66e0078] CompilerSupportLibraries_jll v1.3.0+1
  [4536629a] OpenBLAS_jll v0.3.29+0
  [8e850b90] libblastrampoline_jll v5.15.0+0

julia> versioninfo()
Julia Version 1.12.5
Commit 5fe89b8ddc1 (2026-02-09 16:05 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: Linux (aarch64-linux-gnu)
  CPU: 8 × unknown
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, apple-m2)
  GC: Built with stock GC
Threads: 1 default, 1 interactive, 1 GC (on 8 virtual cores)
Environment:
  JULIA_EDITOR = code

If you can produce a MWE for the identical code but different performance issue that doesn’t require the use of unregistered package versions (ideally just benchmarking packages and Base) I’d be happy to take a closer look.

1 Like

OK. Once the new version is released, I’ll let you know.

Did you try calling reverse twice to eliminate first time call overhead?

julia> using Chairmarks, SmallCollections  
julia> N = 32; T = Int8; v = FixedVector{N,T}(1:N); i = 3; 
julia> @b reverse($v, $i)
julia> @b reverse($v, $i)

My second time call results ~ 5 μs like expected, when Inspecting the runtime behavior with CodeGlass it also confirms my findings that shuffle128 version requires more warmup.

I’ve tried this on both machines, but it doesn’t change anything. The benchmarks with @b change only minimally.

Note: I’ve renamed the branches so that I can modify the branches I was using before. I’ve updated the “Details” section in my first post accordingly.

Interesting, I will investigate further tomorrow and validate my results another time :slight_smile:

1 Like

The benchmark macro (unlike @time) already runs multiple times and reports the fastest sample, so it excludes outliers like first-call compilation overhead.

3 Likes

@stevengj Thank you, warmup seems the logical answer based on my findings but you are right!
Though are you sure that @b returns the fastest sample when duration is very short and not more of an average?

I spend some time dinging into this as things did not add up, and there is an issue in Chairmarks.jl but I am unsure if this is related to your question. I will create a Ticket on GitHub for this.

The issue seems that reverse can run faster than the smallest amount of time accurately measurable (1 tick) which is on my machine 0.1 μs. ( If I am wrong about this, please let me know, I never have found another reliable accurate way)
Below that you need to execute the function multiple times and take the average, I assume that this is also what other benchmark tools do, this is probably why @Lilith said nano-benchmarks are tricky (?).
And when making such an average, it is always good to take “longest single call duration” in consideration.

Below are the measurements I gathered, every time reverse is called 400_001 times and shortest call duration was always 0t.

Statistics from CodeGlass
branch Total Duration Longest single call duration
discourse 97_220 t 156 t
96_188 t 70 t
136_434 t 33_477 t
98_029 t 375 t
162_303 t 63_295 t
96_864 t 58 t
98_790 t 418 t
113_411 t 8_528 t
116_236 t 18_172 t
97_050 t 139 t
v060 97_119 t 145 t
97_258 t 411 t
97_985 t 235 t
106_862 t 8_725 t
108_764 t 8_466 t
97_731 t 117 t
97_903 t 478 t
103_561 t 6_862 t
98_651 t 417 t
116_864 t 19_623 t

Edit: reverse is indeed fully in lined. :slight_smile:

Yes. The documentation for @b says:

Benchmark f and return the fastest Sample.

(That being said, each Sample for a very short-duration call is an average time of several function evaluations, the evals field, in order to get adequate timing resolution. But this is sampling process is repeated multiple times before returning the minimum, which eliminates outliers like one-time startup costs.)

And you can easily verify from the source code that @b uses the summarize function to fetch the output, which indeed returns the minimum.

The justification for using the minimum sampled time is described in Chen and Revels (2016): essentially, the noise in benchmarking is always positive — at least for deterministic functions — since system interruptions never make things go faster, so the minimum is typically the most robust estimator of execution time. (They were not the first authors to use this methodology — e.g., it was used by the lmbench benchmark (McVoy, 1996), and we imitated McVoy in our FFT benchmarks for FFTW — but previous authors hadn’t attempted a statistical justification to my knowledge.)

4 Likes

@stevengj Thank you, this confirms what I assumed and found in the Docs and source code!

I only failed at describing it correctly, in case you like to read my “funny” mistake:

Funny misunderstanding....

I made a wrong assumption here:

I assumed with ‘sample’ you meant a single recorded function call duration.

Which made me go through the source and docs as I was very curious how they managed to do that with such a short running function, but I could only find that they made an average when the duration is this short. At which point I asked you:

At which point I should have noticed the ambiguety of using ‘sample’ and that you meant Sample with it.

I am thankful that this caused you to share the underlying documentation about this, thank you! :slight_smile:

@matthias314 I still don’t have a clear answer for you, however the statistics I proved in my previous post show that there is no actual time difference between branchv060 and discourse-136336 when looking at the full time duration over multiple calls.

I did do some more runs and it does seem that discourse-136336 version does suffer more system interruptions (More frequent and higher longest single call duration) which might explain it, but I am not yet confident in this.

1 Like

@Tyrone Thanks for your help!!

In the meantime I’ve found that a tiny change in the code (passing from UInt8 to Int8 somewhere) makes the whole issue miraculously go away. Understanding the differences in the benchmarks would still interesting to avoid similar problems in the future. However, it’s not an issue for the next release of my package anymore.

A minimal reproducer would help.

I suppose a possible cause is differing alignments in the executable code. I suppose that even when @code_native is identical, it is expected that performance characteristics of machine code may vary due to different offsets and stuff, due perhaps to padding (NOP instructions?) being inserted at different places. I do not remember the specifics, but look into profile-guided optimization (PGO) showcases for proof that this is important. In some experiments, such things can cause more than 10% performance differences as far as I remember.

That’s probably tricky because the benchmarks here seem to be sensitive even to small changes. One could say that the current example is already small in the sense that the function we are talking about has less than 50 assembler instructions (and does not call any other function).

I meant a reproducer that does not depend on any package. Except on Chairmarks.jl / BenchmarkTools.jl.