A bit late to the party, but I like optimizing nested loops in Julia. First, let’s make a version that actually does something distinct on each iteration rather than overwriting the same cc
:
aa = rand(Float32,100)
bb = rand(Float32,100)
cc = zeros(Float32,100)
function test!(cc,aa,bb) # Normally the first argument is the mutated one
for i = 1:100
for j = 1:100
for k = 1:100
cc[i] += aa[j] * bb[k]
end
end
end
return cc # Optional, but doesn't cost anything
end
For benchmarking, I heavily recommend something that will give you a histogram rather than a single number. For example:
julia> using BenchmarkTools
julia> @benchmark test!($cc,$aa,$bb)
BechmarkTools.Trial: 1757 samples with 1 evaluations.
Range (min … max): 2.435 ms … 5.037 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.671 ms ┊ GC (median): 0.00%
Time (mean ± σ): 2.836 ms ± 445.603 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▄█▅▇▆▄▆▂▁▁
██████████▇▅▆▄▅▄▅▄▄▄▃▃▃▃▄▃▃▃▃▃▃▃▃▃▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▃▂▂▂▁▂▂▂▂ ▄
2.44 ms Histogram: frequency by time 4.54 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
Note that if benchmarking in global scope (like the REPL), you need to use $
to interpolate global variables into the benchmark.
This isn’t terrible, and there’s no allocations which is good, but one can get much faster. In particular, by using your CPU’s SIMD vector registers and instructions, which for this example you can easily do with the amazing LoopVectorization.jl
using LoopVectorization
function test_lv!(cc,aa,bb) # Normally the first argument is the mutated one
@turbo for i = 1:100
for j = 1:100
for k = 1:100
cc[i] += aa[j] * bb[k]
end
end
end
return cc # Optional, but doesn't cost anything
end
which gives us
julia> @benchmark test_lv!($cc,$aa,$bb)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
Range (min … max): 106.252 μs … 453.916 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 106.263 μs ┊ GC (median): 0.00%
Time (mean ± σ): 110.554 μs ± 16.103 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█▁ ▃ ▃ ▃ ▁ ▁
██▄▄█▆▇▅▅▆▅█▆▆▄▅▆▅▄▃█▅▆▅▅▅▅▃▄▄▄▅▁▁▃▃▄▃▃▄▁▁▁▃▁▁▃▃▃▁▄▃▁▄▁▁▁▃▁▁█ █
106 μs Histogram: log(frequency) by time 206 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
That’s more than a 20x speedup, and would likely be even better on a computer with AVX512 (I only have AVX2). That’s a bit more of a speedup than typical for LV for (e.g.) Float64
s, but you can fit 2x as many Float32
s in a given vector register, so SIMD vectorization may matter more for smaller number types.
Finally, since Matlab tries to multithread by default, why don’t we do the same to level the playing field (it just takes an extra t
in the macro name)
function test_lvt!(cc,aa,bb) # Normally the first argument is the mutated one
@tturbo for i = 1:100
for j = 1:100
for k = 1:100
cc[i] += aa[j] * bb[k];
end
end
end
return cc # Optional, but doesn't cost anything
end
This gives about another 4x speedup (on a 4 core CPU) despite the small problem size because of the very lightweight threading (from Polyester.jl) used by by @tturbo
.
julia> @benchmark test_lvt!($cc,$aa,$bb)
BechmarkTools.Trial: 10000 samples with 1 evaluations.
Range (min … max): 26.804 μs … 96.401 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 26.919 μs ┊ GC (median): 0.00%
Time (mean ± σ): 27.389 μs ± 2.680 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█▄▁ ▁ ▁
███▆▅▇▇▇█▆▅▄▆▄▅▄▄▄▃▃▁▁▃▄▄▄▁▄▅▅▅▆▆▇▆▆▆▆▄▅▅▆▄▅▄▄▃▃▄▄▄▅▆▄▄▄▄▅▅ █
26.8 μs Histogram: log(frequency) by time 40 μs <
Memory estimate: 0 bytes, allocs estimate: 0.