Faster sum loop when looping through vector multiple times than once

Jad_Zeitouni · September 29, 2023, 11:40am

I was benchmarking different ways of summing a matrix by its second dimension and I got some surprising results (for me) when doing it through a for loop.

I had thought that in general the loop would be faster if the inner loop was going through the rows, so it loads the matrix one column at a time in memory. But given the example below, function f, which loops through columns first, is faster than its counterpart g.

I’m guessing this is because it’s faster to iterate through the vector res over and over again, than calculating the sum for each entry one at a time. Any intuition on why this is the case? Does the CPU guess to keep res in cache if you’re looping through it repeatedly?

julia> using BenchmarkTools
julia> X = randn(10000, 100); Xt = Matrix(X');

julia> function f(x)
           n, m = size(x)
           res = zeros(m)
           for i = 1:n, j = 1:m
               res[j] += x[i, j]
           end
           return res
       end;

julia> function g(x)
           n, m = size(x)
           res = zeros(m)
           for j = 1:m, i = 1:n
               res[j] += x[i, j]
           end
           return res
       end;

julia> function h(x)
           m, n = size(x)
           res = zeros(m)
           for i = 1:n, j = 1:m
               res[j] += x[j, i]
           end
           return res
       end;

julia> function p(x)
           m, n = size(x)
           res = zeros(m)
           for j = 1:m, i = 1:n
               res[j] += x[j, i]
           end
           return res
       end;

julia> @benchmark f($X)
BenchmarkTools.Trial: 7641 samples with 1 evaluation.
 Range (min … max):  535.404 μs …   1.974 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     612.767 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   651.517 μs ± 124.330 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

      ▂██▇▆▄▃▃▂▁                                                ▁
  ▂▃▃▄█████████████▇█▇█▇▇▇▇▇█▇▇▇▇▇▇███▇▇▇▇▇▅▆▆▆▇▆▄▅▃▅▃▃▄▅▅▂▄▄▅▄ █
  535 μs        Histogram: log(frequency) by time       1.22 ms <

 Memory estimate: 896 bytes, allocs estimate: 1.

julia> @benchmark g($X)
BenchmarkTools.Trial: 5624 samples with 1 evaluation.
 Range (min … max):  852.255 μs …  1.672 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     855.119 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   886.767 μs ± 66.944 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▂▁▃▁▁ ▁  ▂▁▃▄▁                                               
  ████████████████████▇▇▆▆▆▆▆▅▅▅▅▆▅▅▄▅▇▅▅▄▄▄▂▅▄▄▄▄▄▅▂▄▄▄▃▄▃▂▂▄ █
  852 μs        Histogram: log(frequency) by time      1.19 ms <

 Memory estimate: 896 bytes, allocs estimate: 1.

julia> @benchmark h($Xt)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  389.913 μs …   3.518 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     449.312 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   492.996 μs ± 204.491 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▃█▇▄▃▂▁▁▁▁▁                                                  ▂
  ████████████████▇▆▅▆▆▆▅▃▅▄▄▅▄▅▄▅▆▄▅▄▄▅▅▄▅▅▄▅▄▄▄▅▅▃▆▅▅▁▄▅▄▃▅▄▃ █
  390 μs        Histogram: log(frequency) by time       1.65 ms <

 Memory estimate: 896 bytes, allocs estimate: 1.

julia> @benchmark p($Xt)
BenchmarkTools.Trial: 3554 samples with 1 evaluation.
 Range (min … max):  1.236 ms …   6.085 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.278 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.402 ms ± 428.580 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▇▆▄▃▂▂▁▂▁▁▁                                                ▁
  █████████████▇█▇▇▆▇▆▆▇▆▆▅▅▆▅▅▅▁▆▆▅▅▅▅▅▆▁▄▅▆▅▆▅▄▅▄▃▆▃▃▁▃▃▁▁▅ █
  1.24 ms      Histogram: log(frequency) by time      3.29 ms <

 Memory estimate: 896 bytes, allocs estimate: 1.

jakobnissen · September 29, 2023, 12:04pm

I find that g is faster on my computer.
The difference between the performance might be that your CPU is able to run fewer instructions in parallel if they all write to the same index of f.
Modern CPUs are superscalar, running multiple instructions at a time per core. In g, you write to the same res[j] multiple iterations after each other. This prevents the CPU from doing multiple iterations at a time, since each iteration has to wait for the last one. This is not the case for f or h.

You can get even faster than this if you enable SIMD.

Mason · September 29, 2023, 12:15pm

Interesting, for me it’s f and h by a wide margin:

julia> @benchmark f($X)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  297.794 μs … 753.505 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     301.170 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   304.258 μs ±  12.985 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

      ▅  █                                                       
  ▂▂▂▅█▇██▇▄▅▃▂▂▂▂▂▂▃▃▃▃▂▂▂▂▂▁▁▁▁▁▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  298 μs           Histogram: frequency by time          324 μs <

 Memory estimate: 896 bytes, allocs estimate: 1.

julia> @benchmark g($X)
BenchmarkTools.Trial: 7704 samples with 1 evaluation.
 Range (min … max):  646.403 μs … 769.385 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     646.844 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   647.623 μs ±   3.835 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▇█▅▆▇▅▂▁       ▅▄▂▃▃▂▂▂▂▃▃▂▂▂▂▁▁▁                             ▂
  ████████▇▇▅▆▇▇▇█████████████████████▇▇▇▇▇▇▆▇▇▇▇▆▄▄▅▅▅▅▆▄▅▄▅▄▃ █
  646 μs        Histogram: log(frequency) by time        653 μs <

 Memory estimate: 896 bytes, allocs estimate: 1.

julia> @benchmark h($Xt)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  471.001 μs … 518.501 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     471.973 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   472.637 μs ±   2.185 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁▆▆▃▁▄▇█▇▄▂  ▁▂▁  ▃▄▃▁             ▁▁▂▂▂▂▂▁▁                  ▂
  ███████████▇▇███▇████████▇▇▇▇█▇█████████████▇▇▆▇▆▆▆▇▆▆▅▅▅▅▄▃▄ █
  471 μs        Histogram: log(frequency) by time        479 μs <

 Memory estimate: 896 bytes, allocs estimate: 1.

julia> @benchmark p($Xt)
BenchmarkTools.Trial: 7488 samples with 1 evaluation.
 Range (min … max):  658.185 μs … 1.202 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     663.926 μs             ┊ GC (median):    0.00%
 Time  (mean ± σ):   666.341 μs ± 9.865 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

      ▄█▇▅▃▁▂▁▂▄▃▂▁                                           ▁
  █▇▅▄████████████████▇▇▇▇▆▆▆▆▄▅▁▅▅▅▄▅▆▅▃▅▅▅▄▄▄▁▄▁▃▃▃▁▃▄▃▃▁▄▆ █
  658 μs       Histogram: log(frequency) by time       715 μs <

 Memory estimate: 896 bytes, allocs estimate: 1.

julia> versioninfo()
Julia Version 1.9.3
Commit bed2cd540a* (2023-08-24 14:43 UTC)

Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 12 × AMD Ryzen 5 5600X 6-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
  Threads: 6 on 12 virtual cores
Environment:
  JULIA_NUM_THREADS = 6

Sukera · September 29, 2023, 12:29pm

It’s h for me:

julia> @benchmark f($X)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  244.771 μs … 798.871 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     255.671 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   258.203 μs ±  15.825 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁▃▅▇██▇▇▅▅▄▂▂▁▁ ▁                                             ▂
  ███████████████████▇██▇██▇▇▇▆▆▅▅▄▄▆▁▁▃▁▁▁▁▄▄▃▁▃▄▁▁▁▁▁▁▁▃▄▅▄▅▅ █
  245 μs        Histogram: log(frequency) by time        360 μs <

 Memory estimate: 896 bytes, allocs estimate: 1.

julia> @benchmark g($X)
BenchmarkTools.Trial: 8870 samples with 1 evaluation.
 Range (min … max):  560.061 μs … 600.742 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     560.452 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   562.089 μs ±   3.415 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃█▄▂▁     ▂▁▃▁▁▁       ▁▁▁▁▁▁                                 ▁
  ██████▇▇▆▆████████████▇████████▇▇▇▇▇▇▆▅▆▅▆▅▆▆▅▆▆▅▆▅▆▆▆▄▅▅▄▄▃▅ █
  560 μs        Histogram: log(frequency) by time        575 μs <

 Memory estimate: 896 bytes, allocs estimate: 1.

julia> @benchmark h($Xt)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  215.100 μs … 317.981 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     219.281 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   220.806 μs ±   6.007 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

      ▄▆██▅▂▂▃▃▃▂▂▁▁▁                                           ▂
  ▄▂▃▇████████████████▇▇▆▇▆▇▆▇▅▆▆▆▅▄▅▅▄▃▅▅▄▅▄▅▅▃▄▃▃▄▄▅▅▆▇▇▆▄▅▅▆ █
  215 μs        Histogram: log(frequency) by time        251 μs <

 Memory estimate: 896 bytes, allocs estimate: 1.

julia> @benchmark p($Xt)
BenchmarkTools.Trial: 8721 samples with 1 evaluation.
 Range (min … max):  565.141 μs …  1.102 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     568.971 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   571.699 μs ± 10.563 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅▇▅▂▅█▆▄▅▅▅▅▃▂▃▃▂▂▁▁▁▁▁▁                                     ▂
  ████████████████████████████▇▇▇▇▆▇▆▆▆▇▆▅▆▅▆▆▄▆▄▅▆▇▇▇▆▇▇█▇▇▆▆ █
  565 μs        Histogram: log(frequency) by time       604 μs <

 Memory estimate: 896 bytes, allocs estimate: 1.

It’s always good to keep the machine you’re benchmarking on in mind.


julia> versioninfo()
Julia Version 1.11.0-DEV.421
Commit 5be358fdd0c (2023-09-12 16:03 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 24 × AMD Ryzen 9 7900X 12-Core Processor
  WORD_SIZE: 64
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
  Threads: 34 on 24 virtual cores
Environment:
  JULIA_PKG_USE_CLI_GIT = true

@Jad_Zeitouni , what output do you get when you run versioninfo()?

Jad_Zeitouni · September 29, 2023, 1:45pm

It’s so odd that f is faster than h. The two are the same except for the order in which they access x, which should be much favorable in h since it is column-major order.

Jad_Zeitouni · September 29, 2023, 1:47pm

Here’s my version info

julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e90 (2023-07-05 09:39 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × Intel(R) Core(TM) i7-10875H CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 1 on 16 virtual cores

Topic		Replies	Views
Summing matrix elements is >1000X slower than summing vector elements General Usage performance	8	1330	April 17, 2017
Improving the performance of a sum over a vector of vectors Performance performance , loops , iterators	10	639	July 13, 2022
How to achieve the greatest speed with nested loops? Performance	10	1910	July 24, 2020
Performance difference when accessing square matrix rows-first or cols-first Performance	14	1732	April 13, 2021
Reduce memory allocated in array view and in place sum Performance question	12	693	November 10, 2023

Faster sum loop when looping through vector multiple times than once

Related topics