Strange speed gain from eachindex and writing in a useless array?

This code doesn’t do anything useful, it is just intended as a MWE. my_func! outputs the sum s in ≈ 120μs on my computer. Strangely, when I uncomment the lines adding to the array useless, it seems to do the same computation in under 50 μs. How could doing more work lead to better performances? I couldn’t find any type instability that would explain this. If this speedup is somehow real, I would like to know how to implement it in a less clunky way.

using BenchmarkTools

x = rand(100)
useless = zeros(100)

function my_func!(useless, x)
    useless .= 0.
    s = 0.
    for i in eachindex(x)
        s_in = 0.
        for j in eachindex(x)
            s_in += log(√(x[j] + i * j))
        end
        s += s_in
        #=
        for j in eachindex(useless)
            useless[j] += s_in
        end
        =#
    end
    return s 
end

function my_useless_func!(useless, x)
    useless .= 0.
    s = 0.
    for i in eachindex(x)
        s_in = 0.
        for j in eachindex(x)
            s_in += log(√(x[j] + i * j))
        end
        s += s_in
        for j in eachindex(useless)
            useless[j] += s_in
        end
    end
    return s 
end
julia> @btime my_func!($useless, $x)
  120.532 μs (0 allocations: 0 bytes)
36381.83512310656

julia> @btime my_useless_func!($useless, $x)
  48.597 μs (0 allocations: 0 bytes)
36381.83512310656

I have tried many things to explain this difference like adding @inbounds and @simd with no big difference to the results. One very strange finding is that despite the fact that eachindex(x) is Base.OneTo(100), when I replace the line

for i in eachindex(x)

with

for i in Base.OneTo(100)

in both functions, my_func! still takes ≈ 120μs, but my_useless_func! now takes longer (≈129μs) as I would normally expect.

3 Likes

can you show your versioninfo() ? I’m not able to reproduce this

Might be platform dependent. @code_llvm is showing significantly more code for my_useless_func!(useless, x) than my_func!(useless, x) as one would expect. I’m seeing a much more overlapping distribution, though my_useless_func! is marginally ahead in the @benchmark stats. Always check that if @btime is looking odd.

julia> @benchmark my_useless_func!($useless, $x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  230.600 μs …  1.889 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     230.800 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   256.055 μs ± 67.980 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▂▂ ▁            ▃                            ▂     ▂        ▁
  █████▇▇█▇▇▆▇▆▆█▆▆██▆▇▆▅█▆▅▅▅▅▅▆▆▄▄▅▃█▇▅▅▆▅▃▃▄▄█▆█▅▄▅█▆▆▅▄▅█▇ █
  231 μs        Histogram: log(frequency) by time       440 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark my_func!($useless, $x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  241.700 μs …  1.444 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     242.500 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   257.585 μs ± 50.336 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▂▄ ▁              ▃                               ▂      ▁  ▁
  ██████▇█▇█▇▇▇▆▆▅▇▅▅█▆▇▇▅▅▅▆▆▅▄▆▆▃▄▅▇▅▅▁▄▇▅▄▃▄▇▆▅▄▁▄██▇▅▄▄▄█▇ █
  242 μs        Histogram: log(frequency) by time       439 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

EDIT: Adding my versioninfo as a datapoint for anyone digging into the platform dependence here:

julia> versioninfo()
Julia Version 1.11.7
Commit f2b3dbda30 (2025-09-08 12:10 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 × Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, icelake-client)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)
1 Like

I’m on Windows Subsystem for Linux

julia> versioninfo()
Julia Version 1.11.6
Commit 9615af0f269 (2025-07-09 12:58 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 × 13th Gen Intel(R) Core(TM) i9-13900HX
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, alderlake)
Threads: 1 default, 0 interactive, 1 GC (on 32 virtual cores)

With @benchmark

julia> @benchmark my_func!($useless, $x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  120.540 μs … 627.523 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     125.261 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   139.701 μs ±  41.001 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ██▃▄▂▁▁▁▁ ▁ ▁ ▁                                ▂   ▂          ▁
  █████████▆███▇███▇█▆█▇██▇▇▆▆█▅█▅█▅▅█▅▃█▅▇▆▄▅█▅▅█▅▆▇█▇▆▇▆▅▄▄▆▅ █
  121 μs        Histogram: log(frequency) by time        288 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark my_useless_func!($useless, $x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  48.607 μs … 950.677 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     48.743 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   53.008 μs ±  16.512 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▃▄▆▃▂▂▂▁▂▁▂ ▁▁ ▁                                          ▁ ▂
  ████████████▇██▆█▆██▆█▅▇▅▆▄▅▅▅▅▃▅▅▃▆▁▄▇▃▄▆▄▄▇▃▃▅▆▃▃▇▄▃▄█▄▄▄█ █
  48.6 μs       Histogram: log(frequency) by time       105 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

On my machine (a Linux server) the Base.OneTo(100) is faster.

julia> function my_func!(useless, x)
           useless .= 0.
           s = 0.
           for i in eachindex(x)
               s_in = 0.
               for j in eachindex(x)
                   s_in += log(√(x[j] + i * j))
               end
               s += s_in
           end
           s 
       end
my_func! (generic function with 1 method)

julia> @btime my_func!($useless, $x);
  86.594 μs (0 allocations: 0 bytes)

julia> function my_func!(useless, x)
           useless .= 0.
           s = 0.
           for i in Base.OneTo(100)
               s_in = 0.
               for j in Base.OneTo(100)
                   s_in += log(√(x[j] + i * j))
               end
               s += s_in
           end
           s 
       end
my_func! (generic function with 1 method)

julia> @btime my_func!($useless, $x);
  81.984 μs (0 allocations: 0 bytes)

My runs are much closer.

julia> @benchmark my_useless_func!($useless, $x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  40.375 μs … 56.542 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     40.625 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   40.885 μs ±  1.423 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆█▅    ▂▂                                                   ▁
  ███▇▆▇███▇▆▃▆▆▅▆▆▆▆▄▆▆▅▅▅▅▆▅▆▇▆▆█▆▄▄▃▅▅▄▅▃▄▄▄▄▄▃▄▃▁▃▄▅▃▃▄▄▄ █
  40.4 μs      Histogram: log(frequency) by time      49.6 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark my_func!($useless, $x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  37.875 μs … 56.125 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     38.084 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   38.380 μs ±  1.367 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▇█▆▄   ▃▄▂                                                  ▂
  █████▆▆███▇▅▃▆▅▅▅▅▇▆▆▅▇▅▄▆▅▅▅▅▅▅▄▆▆▇▇▇▅▆▅▄▄▄▁▄▃▅▄▄▅▅▃▃▄▄▃▃▄ █
  37.9 μs      Histogram: log(frequency) by time      45.8 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

I also can’t reproduce, to me, they have essentially the same timings. The extra work that my_useless_func! is doing over the other function is an insignificant part of the total work in the function, because it compiles to a vectorized additions, and useless is already in L1 cache.

1 Like

I actually can reproduce on linux, but not on macos. make an issue?

2 Likes

It might be cpu-related. I can’t reproduce on any version of julia:

julia> versioninfo()
Julia Version 1.12.0-rc2
Commit 72cbf019d04 (2025-09-06 12:00 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × AMD Ryzen 7 2700X Eight-Core Processor
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, znver1)
  GC: Built with stock GC
Threads: 16 default, 1 interactive, 16 GC (on 16 virtual cores)
1 Like

I thought it might have something to do with virtualization, but my ubuntu virtual machine results are faster than native macOS

julia> @btime my_useless_func!($useless, $x)
  32.626 μs (0 allocations: 0 bytes)
36380.97379274098

julia> @benchmark my_useless_func!($useless, $x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  32.668 μs … 77.753 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     33.334 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   33.554 μs ±  1.155 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

        ▅█▆        ▁                                 ▁▁       ▁
  ▄▁▅▆▆▆███▁▁▅▄▅▅▁▇█▄▄▄▁▁▁▃▅▁▁▃▁▁▃▁▅▅▅▅▅▅▅▅▅▄▅▄▃▄▅▃▅████▇▇▇▆▅ █
  32.7 μs      Histogram: log(frequency) by time      37.7 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark my_func!($useless, $x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  30.209 μs … 117.796 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     30.627 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   30.865 μs ±   1.396 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

      ▆█▅       ▁                                      ▁▁      ▁
  ▃▄▁▃███▃▁▁▄▄▁▅██▄▁▄▄▃▁▁▃▄▃▁▃▄▄▁▄▄▃▄▆▅▅▅▆▄▅▅▅▃▄▄▅▄▄▃▄▇████▆▆▆ █
  30.2 μs       Histogram: log(frequency) by time      34.9 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Thanks everyone for your help. Seeing how most of you can’t reproduce this behavior, I thought of another experiment: installing Julia directly on Windows instead of WSL (on the same machine). The code runs a bit more slowly and, there seems to be a slight improvement from my_useless_func, but nowhere near as dramatic as before.

julia> @benchmark my_func!($useless, $x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  134.300 μs … 784.400 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     145.200 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   147.662 μs ±  14.218 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █ ▁▂  █▅ ▁█ ▁▁▇ ▁▁█ ▁▂▆  ▁▇  ▁▆   ▁▅    ▅    ▁    ▁           ▃
  █▅███▇██▇██▇███████▇███▇███▇████▇▇██▆▆▇▇█▅▇▇▆█▄▂▆▅█▄▆▆▆▅▇▄▃▄▄ █
  134 μs        Histogram: log(frequency) by time        181 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark my_useless_func!($useless, $x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  125.300 μs … 452.600 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     135.500 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   139.672 μs ±  22.715 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▅█▃▄▄
  ▆▃█████▄▅▄▃▂▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▁▂▂▂▁▂▂▂▂▁▂▂▁▂ ▃
  125 μs           Histogram: frequency by time          269 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> versioninfo()
Julia Version 1.11.7
Commit f2b3dbda30 (2025-09-08 12:10 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 32 × 13th Gen Intel(R) Core(TM) i9-13900HX
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, alderlake)
Threads: 1 default, 0 interactive, 1 GC (on 32 virtual cores)

julia>

This strange outcome is highly platform specific.

Have you compared the @code_llvm of the eachindex(x) and Base.OneTo(100) versions of my_useless_func! where there’s the 2.6x difference? If anything is on the compiler’s end, it’d be there. Otherwise, I’d chalk it up to stumbling into an obscure condition that made your processor happier.

I am not an all an expert in parsing LLVM, but I see loopexit several times in useless version but not the clean version. reading LLVM Loop Terminology (and Canonical Forms) — LLVM 22.0.0git documentation, I see

It is also possible that two logical loops share a header, but are considered a single loop by LLVM:
for (int i = 0; i < 128; ++i)
for (int j = 0; j < 128; ++j)
body(i,j);
which might be represented in LLVM-IR as follows. Note that there is only a single header and hence just a single loop

so maybe in the clean version, the compiler turns for i in eachindex(x); for j in eachindex(x) into a single loop, and in the useless version it keeps them as two loops?

It persists in v1.12.1.
I’m a bit surprised this didn’t catch more attention. Is there an associated github issue?

julia> versioninfo()
Julia Version 1.12.1
Commit ba1e628ee49 (2025-10-17 13:02 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 12 × 13th Gen Intel(R) Core(TM) i7-1365U
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, alderlake)
  GC: Built with stock GC
Threads: 12 default, 1 interactive, 12 GC (on 12 virtual cores)