This code doesn’t do anything useful, it is just intended as a MWE. my_func! outputs the sum s in ≈ 120μs on my computer. Strangely, when I uncomment the lines adding to the array useless, it seems to do the same computation in under 50 μs. How could doing more work lead to better performances? I couldn’t find any type instability that would explain this. If this speedup is somehow real, I would like to know how to implement it in a less clunky way.
using BenchmarkTools
x = rand(100)
useless = zeros(100)
function my_func!(useless, x)
useless .= 0.
s = 0.
for i in eachindex(x)
s_in = 0.
for j in eachindex(x)
s_in += log(√(x[j] + i * j))
end
s += s_in
#=
for j in eachindex(useless)
useless[j] += s_in
end
=#
end
return s
end
function my_useless_func!(useless, x)
useless .= 0.
s = 0.
for i in eachindex(x)
s_in = 0.
for j in eachindex(x)
s_in += log(√(x[j] + i * j))
end
s += s_in
for j in eachindex(useless)
useless[j] += s_in
end
end
return s
end
I have tried many things to explain this difference like adding @inbounds and @simd with no big difference to the results. One very strange finding is that despite the fact that eachindex(x) is Base.OneTo(100), when I replace the line
for i in eachindex(x)
with
for i in Base.OneTo(100)
in both functions, my_func! still takes ≈ 120μs, but my_useless_func! now takes longer (≈129μs) as I would normally expect.
Might be platform dependent. @code_llvm is showing significantly more code for my_useless_func!(useless, x) than my_func!(useless, x) as one would expect. I’m seeing a much more overlapping distribution, though my_useless_func! is marginally ahead in the @benchmark stats. Always check that if @btime is looking odd.
julia> @benchmark my_useless_func!($useless, $x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
Range (min … max): 230.600 μs … 1.889 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 230.800 μs ┊ GC (median): 0.00%
Time (mean ± σ): 256.055 μs ± 67.980 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█▂▂ ▁ ▃ ▂ ▂ ▁
█████▇▇█▇▇▆▇▆▆█▆▆██▆▇▆▅█▆▅▅▅▅▅▆▆▄▄▅▃█▇▅▅▆▅▃▃▄▄█▆█▅▄▅█▆▆▅▄▅█▇ █
231 μs Histogram: log(frequency) by time 440 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark my_func!($useless, $x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
Range (min … max): 241.700 μs … 1.444 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 242.500 μs ┊ GC (median): 0.00%
Time (mean ± σ): 257.585 μs ± 50.336 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█▂▄ ▁ ▃ ▂ ▁ ▁
██████▇█▇█▇▇▇▆▆▅▇▅▅█▆▇▇▅▅▅▆▆▅▄▆▆▃▄▅▇▅▅▁▄▇▅▄▃▄▇▆▅▄▁▄██▇▅▄▄▄█▇ █
242 μs Histogram: log(frequency) by time 439 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
EDIT: Adding my versioninfo as a datapoint for anyone digging into the platform dependence here:
julia> versioninfo()
Julia Version 1.11.7
Commit f2b3dbda30 (2025-09-08 12:10 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 8 × Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz
WORD_SIZE: 64
LLVM: libLLVM-16.0.6 (ORCJIT, icelake-client)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)
On my machine (a Linux server) the Base.OneTo(100) is faster.
julia> function my_func!(useless, x)
useless .= 0.
s = 0.
for i in eachindex(x)
s_in = 0.
for j in eachindex(x)
s_in += log(√(x[j] + i * j))
end
s += s_in
end
s
end
my_func! (generic function with 1 method)
julia> @btime my_func!($useless, $x);
86.594 μs (0 allocations: 0 bytes)
julia> function my_func!(useless, x)
useless .= 0.
s = 0.
for i in Base.OneTo(100)
s_in = 0.
for j in Base.OneTo(100)
s_in += log(√(x[j] + i * j))
end
s += s_in
end
s
end
my_func! (generic function with 1 method)
julia> @btime my_func!($useless, $x);
81.984 μs (0 allocations: 0 bytes)
I also can’t reproduce, to me, they have essentially the same timings. The extra work that my_useless_func! is doing over the other function is an insignificant part of the total work in the function, because it compiles to a vectorized additions, and useless is already in L1 cache.
Thanks everyone for your help. Seeing how most of you can’t reproduce this behavior, I thought of another experiment: installing Julia directly on Windows instead of WSL (on the same machine). The code runs a bit more slowly and, there seems to be a slight improvement from my_useless_func, but nowhere near as dramatic as before.
julia> @benchmark my_func!($useless, $x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
Range (min … max): 134.300 μs … 784.400 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 145.200 μs ┊ GC (median): 0.00%
Time (mean ± σ): 147.662 μs ± 14.218 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█ ▁▂ █▅ ▁█ ▁▁▇ ▁▁█ ▁▂▆ ▁▇ ▁▆ ▁▅ ▅ ▁ ▁ ▃
█▅███▇██▇██▇███████▇███▇███▇████▇▇██▆▆▇▇█▅▇▇▆█▄▂▆▅█▄▆▆▆▅▇▄▃▄▄ █
134 μs Histogram: log(frequency) by time 181 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark my_useless_func!($useless, $x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
Range (min … max): 125.300 μs … 452.600 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 135.500 μs ┊ GC (median): 0.00%
Time (mean ± σ): 139.672 μs ± 22.715 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▅█▃▄▄
▆▃█████▄▅▄▃▂▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▁▂▂▂▁▂▂▂▂▁▂▂▁▂ ▃
125 μs Histogram: frequency by time 269 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> versioninfo()
Julia Version 1.11.7
Commit f2b3dbda30 (2025-09-08 12:10 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 32 × 13th Gen Intel(R) Core(TM) i9-13900HX
WORD_SIZE: 64
LLVM: libLLVM-16.0.6 (ORCJIT, alderlake)
Threads: 1 default, 0 interactive, 1 GC (on 32 virtual cores)
julia>
Have you compared the @code_llvm of the eachindex(x) and Base.OneTo(100) versions of my_useless_func! where there’s the 2.6x difference? If anything is on the compiler’s end, it’d be there. Otherwise, I’d chalk it up to stumbling into an obscure condition that made your processor happier.
It is also possible that two logical loops share a header, but are considered a single loop by LLVM:
for (int i = 0; i < 128; ++i)
for (int j = 0; j < 128; ++j)
body(i,j);
which might be represented in LLVM-IR as follows. Note that there is only a single header and hence just a single loop
so maybe in the clean version, the compiler turns for i in eachindex(x); for j in eachindex(x) into a single loop, and in the useless version it keeps them as two loops?