I was benchmarking different ways of summing a matrix by its second dimension and I got some surprising results (for me) when doing it through a for loop.
I had thought that in general the loop would be faster if the inner loop was going through the rows, so it loads the matrix one column at a time in memory. But given the example below, function f, which loops through columns first, is faster than its counterpart g.
Iβm guessing this is because itβs faster to iterate through the vector res over and over again, than calculating the sum for each entry one at a time. Any intuition on why this is the case? Does the CPU guess to keep res in cache if youβre looping through it repeatedly?
julia> using BenchmarkTools
julia> X = randn(10000, 100); Xt = Matrix(X');
julia> function f(x)
n, m = size(x)
res = zeros(m)
for i = 1:n, j = 1:m
res[j] += x[i, j]
end
return res
end;
julia> function g(x)
n, m = size(x)
res = zeros(m)
for j = 1:m, i = 1:n
res[j] += x[i, j]
end
return res
end;
julia> function h(x)
m, n = size(x)
res = zeros(m)
for i = 1:n, j = 1:m
res[j] += x[j, i]
end
return res
end;
julia> function p(x)
m, n = size(x)
res = zeros(m)
for j = 1:m, i = 1:n
res[j] += x[j, i]
end
return res
end;
julia> @benchmark f($X)
BenchmarkTools.Trial: 7641 samples with 1 evaluation.
Range (min β¦ max): 535.404 ΞΌs β¦ 1.974 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 612.767 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 651.517 ΞΌs Β± 124.330 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββ βββββββ ββ ββββ β ββββ β β
535 ΞΌs Histogram: log(frequency) by time 1.22 ms <
Memory estimate: 896 bytes, allocs estimate: 1.
julia> @benchmark g($X)
BenchmarkTools.Trial: 5624 samples with 1 evaluation.
Range (min β¦ max): 852.255 ΞΌs β¦ 1.672 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 855.119 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 886.767 ΞΌs Β± 66.944 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββββ β βββββ
ββββββββββββββββββββββββββββ β β β ββ β ββ ββ β βββββ ββββββ ββββββββββ β
852 ΞΌs Histogram: log(frequency) by time 1.19 ms <
Memory estimate: 896 bytes, allocs estimate: 1.
julia> @benchmark h($Xt)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 389.913 ΞΌs β¦ 3.518 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 449.312 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 492.996 ΞΌs Β± 204.491 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
βββββββββββ β
βββββββββββββββββββ ββββ ββ βββ ββ ββ βββ βββ β ββ β ββ ββββ β βββ β βββ βββ ββ β
390 ΞΌs Histogram: log(frequency) by time 1.65 ms <
Memory estimate: 896 bytes, allocs estimate: 1.
julia> @benchmark p($Xt)
BenchmarkTools.Trial: 3554 samples with 1 evaluation.
Range (min β¦ max): 1.236 ms β¦ 6.085 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 1.278 ms β GC (median): 0.00%
Time (mean Β± Ο): 1.402 ms Β± 428.580 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββββββββββ β
βββββββββββββββββββββββββ β ββ β β ββββ β β β β ββββ ββ ββ ββ βββββββββββ β
1.24 ms Histogram: log(frequency) by time 3.29 ms <
Memory estimate: 896 bytes, allocs estimate: 1.
I find that g is faster on my computer.
The difference between the performance might be that your CPU is able to run fewer instructions in parallel if they all write to the same index of f.
Modern CPUs are superscalar, running multiple instructions at a time per core. In g, you write to the same res[j] multiple iterations after each other. This prevents the CPU from doing multiple iterations at a time, since each iteration has to wait for the last one. This is not the case for f or h.
You can get even faster than this if you enable SIMD.
Itβs so odd that f is faster than h. The two are the same except for the order in which they access x, which should be much favorable in h since it is column-major order.