Faster sum loop when looping through vector multiple times than once

I was benchmarking different ways of summing a matrix by its second dimension and I got some surprising results (for me) when doing it through a for loop.

I had thought that in general the loop would be faster if the inner loop was going through the rows, so it loads the matrix one column at a time in memory. But given the example below, function f, which loops through columns first, is faster than its counterpart g.

I’m guessing this is because it’s faster to iterate through the vector res over and over again, than calculating the sum for each entry one at a time. Any intuition on why this is the case? Does the CPU guess to keep res in cache if you’re looping through it repeatedly?

julia> using BenchmarkTools
julia> X = randn(10000, 100); Xt = Matrix(X');

julia> function f(x)
           n, m = size(x)
           res = zeros(m)
           for i = 1:n, j = 1:m
               res[j] += x[i, j]
           end
           return res
       end;

julia> function g(x)
           n, m = size(x)
           res = zeros(m)
           for j = 1:m, i = 1:n
               res[j] += x[i, j]
           end
           return res
       end;

julia> function h(x)
           m, n = size(x)
           res = zeros(m)
           for i = 1:n, j = 1:m
               res[j] += x[j, i]
           end
           return res
       end;

julia> function p(x)
           m, n = size(x)
           res = zeros(m)
           for j = 1:m, i = 1:n
               res[j] += x[j, i]
           end
           return res
       end;

julia> @benchmark f($X)
BenchmarkTools.Trial: 7641 samples with 1 evaluation.
 Range (min … max):  535.404 ΞΌs …   1.974 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     612.767 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   651.517 ΞΌs Β± 124.330 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

      β–‚β–ˆβ–ˆβ–‡β–†β–„β–ƒβ–ƒβ–‚β–                                                ▁
  β–‚β–ƒβ–ƒβ–„β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–‡β–ˆβ–‡β–‡β–‡β–‡β–‡β–ˆβ–‡β–‡β–‡β–‡β–‡β–‡β–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–‡β–‡β–…β–†β–†β–†β–‡β–†β–„β–…β–ƒβ–…β–ƒβ–ƒβ–„β–…β–…β–‚β–„β–„β–…β–„ β–ˆ
  535 ΞΌs        Histogram: log(frequency) by time       1.22 ms <

 Memory estimate: 896 bytes, allocs estimate: 1.

julia> @benchmark g($X)
BenchmarkTools.Trial: 5624 samples with 1 evaluation.
 Range (min … max):  852.255 ΞΌs …  1.672 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     855.119 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   886.767 ΞΌs Β± 66.944 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆβ–‚β–β–ƒβ–β– ▁  ▂▁▃▄▁                                               
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–†β–†β–†β–†β–…β–…β–…β–…β–†β–…β–…β–„β–…β–‡β–…β–…β–„β–„β–„β–‚β–…β–„β–„β–„β–„β–„β–…β–‚β–„β–„β–„β–ƒβ–„β–ƒβ–‚β–‚β–„ β–ˆ
  852 ΞΌs        Histogram: log(frequency) by time      1.19 ms <

 Memory estimate: 896 bytes, allocs estimate: 1.

julia> @benchmark h($Xt)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  389.913 ΞΌs …   3.518 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     449.312 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   492.996 ΞΌs Β± 204.491 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

   β–ƒβ–ˆβ–‡β–„β–ƒβ–‚β–β–β–β–β–                                                  β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–…β–†β–†β–†β–…β–ƒβ–…β–„β–„β–…β–„β–…β–„β–…β–†β–„β–…β–„β–„β–…β–…β–„β–…β–…β–„β–…β–„β–„β–„β–…β–…β–ƒβ–†β–…β–…β–β–„β–…β–„β–ƒβ–…β–„β–ƒ β–ˆ
  390 ΞΌs        Histogram: log(frequency) by time       1.65 ms <

 Memory estimate: 896 bytes, allocs estimate: 1.

julia> @benchmark p($Xt)
BenchmarkTools.Trial: 3554 samples with 1 evaluation.
 Range (min … max):  1.236 ms …   6.085 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     1.278 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   1.402 ms Β± 428.580 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆβ–‡β–†β–„β–ƒβ–‚β–‚β–β–‚β–β–β–                                                ▁
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–‡β–‡β–†β–‡β–†β–†β–‡β–†β–†β–…β–…β–†β–…β–…β–…β–β–†β–†β–…β–…β–…β–…β–…β–†β–β–„β–…β–†β–…β–†β–…β–„β–…β–„β–ƒβ–†β–ƒβ–ƒβ–β–ƒβ–ƒβ–β–β–… β–ˆ
  1.24 ms      Histogram: log(frequency) by time      3.29 ms <

 Memory estimate: 896 bytes, allocs estimate: 1.

I find that g is faster on my computer.
The difference between the performance might be that your CPU is able to run fewer instructions in parallel if they all write to the same index of f.
Modern CPUs are superscalar, running multiple instructions at a time per core. In g, you write to the same res[j] multiple iterations after each other. This prevents the CPU from doing multiple iterations at a time, since each iteration has to wait for the last one. This is not the case for f or h.

You can get even faster than this if you enable SIMD.

Interesting, for me it’s f and h by a wide margin:

julia> @benchmark f($X)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  297.794 ΞΌs … 753.505 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     301.170 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   304.258 ΞΌs Β±  12.985 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

      β–…  β–ˆ                                                       
  β–‚β–‚β–‚β–…β–ˆβ–‡β–ˆβ–ˆβ–‡β–„β–…β–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–ƒβ–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–β–β–β–β–β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β– β–‚
  298 ΞΌs           Histogram: frequency by time          324 ΞΌs <

 Memory estimate: 896 bytes, allocs estimate: 1.

julia> @benchmark g($X)
BenchmarkTools.Trial: 7704 samples with 1 evaluation.
 Range (min … max):  646.403 ΞΌs … 769.385 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     646.844 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   647.623 ΞΌs Β±   3.835 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–‡β–ˆβ–…β–†β–‡β–…β–‚β–       ▅▄▂▃▃▂▂▂▂▃▃▂▂▂▂▁▁▁                             β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–…β–†β–‡β–‡β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–‡β–‡β–‡β–†β–‡β–‡β–‡β–‡β–†β–„β–„β–…β–…β–…β–…β–†β–„β–…β–„β–…β–„β–ƒ β–ˆ
  646 ΞΌs        Histogram: log(frequency) by time        653 ΞΌs <

 Memory estimate: 896 bytes, allocs estimate: 1.

julia> @benchmark h($Xt)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  471.001 ΞΌs … 518.501 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     471.973 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   472.637 ΞΌs Β±   2.185 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–β–†β–†β–ƒβ–β–„β–‡β–ˆβ–‡β–„β–‚  ▁▂▁  ▃▄▃▁             ▁▁▂▂▂▂▂▁▁                  β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–‡β–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–‡β–†β–†β–†β–‡β–†β–†β–…β–…β–…β–…β–„β–ƒβ–„ β–ˆ
  471 ΞΌs        Histogram: log(frequency) by time        479 ΞΌs <

 Memory estimate: 896 bytes, allocs estimate: 1.

julia> @benchmark p($Xt)
BenchmarkTools.Trial: 7488 samples with 1 evaluation.
 Range (min … max):  658.185 ΞΌs … 1.202 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     663.926 ΞΌs             β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   666.341 ΞΌs Β± 9.865 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

      β–„β–ˆβ–‡β–…β–ƒβ–β–‚β–β–‚β–„β–ƒβ–‚β–                                           ▁
  β–ˆβ–‡β–…β–„β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–‡β–†β–†β–†β–†β–„β–…β–β–…β–…β–…β–„β–…β–†β–…β–ƒβ–…β–…β–…β–„β–„β–„β–β–„β–β–ƒβ–ƒβ–ƒβ–β–ƒβ–„β–ƒβ–ƒβ–β–„β–† β–ˆ
  658 ΞΌs       Histogram: log(frequency) by time       715 ΞΌs <

 Memory estimate: 896 bytes, allocs estimate: 1.
julia> versioninfo()
Julia Version 1.9.3
Commit bed2cd540a* (2023-08-24 14:43 UTC)

Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 12 Γ— AMD Ryzen 5 5600X 6-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
  Threads: 6 on 12 virtual cores
Environment:
  JULIA_NUM_THREADS = 6

It’s h for me:

julia> @benchmark f($X)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  244.771 ΞΌs … 798.871 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     255.671 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   258.203 ΞΌs Β±  15.825 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–β–ƒβ–…β–‡β–ˆβ–ˆβ–‡β–‡β–…β–…β–„β–‚β–‚β–β– ▁                                             β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–‡β–ˆβ–ˆβ–‡β–‡β–‡β–†β–†β–…β–…β–„β–„β–†β–β–β–ƒβ–β–β–β–β–„β–„β–ƒβ–β–ƒβ–„β–β–β–β–β–β–β–β–ƒβ–„β–…β–„β–…β–… β–ˆ
  245 ΞΌs        Histogram: log(frequency) by time        360 ΞΌs <

 Memory estimate: 896 bytes, allocs estimate: 1.

julia> @benchmark g($X)
BenchmarkTools.Trial: 8870 samples with 1 evaluation.
 Range (min … max):  560.061 ΞΌs … 600.742 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     560.452 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   562.089 ΞΌs Β±   3.415 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ƒβ–ˆβ–„β–‚β–     ▂▁▃▁▁▁       ▁▁▁▁▁▁                                 ▁
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–‡β–‡β–‡β–†β–…β–†β–…β–†β–…β–†β–†β–…β–†β–†β–…β–†β–…β–†β–†β–†β–„β–…β–…β–„β–„β–ƒβ–… β–ˆ
  560 ΞΌs        Histogram: log(frequency) by time        575 ΞΌs <

 Memory estimate: 896 bytes, allocs estimate: 1.

julia> @benchmark h($Xt)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  215.100 ΞΌs … 317.981 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     219.281 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   220.806 ΞΌs Β±   6.007 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

      β–„β–†β–ˆβ–ˆβ–…β–‚β–‚β–ƒβ–ƒβ–ƒβ–‚β–‚β–β–β–                                           β–‚
  β–„β–‚β–ƒβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–‡β–†β–‡β–†β–‡β–…β–†β–†β–†β–…β–„β–…β–…β–„β–ƒβ–…β–…β–„β–…β–„β–…β–…β–ƒβ–„β–ƒβ–ƒβ–„β–„β–…β–…β–†β–‡β–‡β–†β–„β–…β–…β–† β–ˆ
  215 ΞΌs        Histogram: log(frequency) by time        251 ΞΌs <

 Memory estimate: 896 bytes, allocs estimate: 1.

julia> @benchmark p($Xt)
BenchmarkTools.Trial: 8721 samples with 1 evaluation.
 Range (min … max):  565.141 ΞΌs …  1.102 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     568.971 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   571.699 ΞΌs Β± 10.563 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–…β–‡β–…β–‚β–…β–ˆβ–†β–„β–…β–…β–…β–…β–ƒβ–‚β–ƒβ–ƒβ–‚β–‚β–β–β–β–β–β–                                     β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–‡β–†β–‡β–†β–†β–†β–‡β–†β–…β–†β–…β–†β–†β–„β–†β–„β–…β–†β–‡β–‡β–‡β–†β–‡β–‡β–ˆβ–‡β–‡β–†β–† β–ˆ
  565 ΞΌs        Histogram: log(frequency) by time       604 ΞΌs <

 Memory estimate: 896 bytes, allocs estimate: 1.

It’s always good to keep the machine you’re benchmarking on in mind.


julia> versioninfo()
Julia Version 1.11.0-DEV.421
Commit 5be358fdd0c (2023-09-12 16:03 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 24 Γ— AMD Ryzen 9 7900X 12-Core Processor
  WORD_SIZE: 64
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
  Threads: 34 on 24 virtual cores
Environment:
  JULIA_PKG_USE_CLI_GIT = true

@Jad_Zeitouni , what output do you get when you run versioninfo()?

It’s so odd that f is faster than h. The two are the same except for the order in which they access x, which should be much favorable in h since it is column-major order.

Here’s my version info

julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e90 (2023-07-05 09:39 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 Γ— Intel(R) Core(TM) i7-10875H CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 1 on 16 virtual cores
1 Like