The compiler has more information about a vector than a matrix: it knows the second dimension is 1
. This can yield better performance in cases such as:
julia> x = rand(128); X = reshape(x, (length(x),1));
julia> typeof.((x, X))
(Array{Float64,1}, Array{Float64,2})
julia> A = rand(32,128); B = similar(A);
julia> @benchmark @. $B = $A + $x'
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 773.423 ns (0.00% GC)
median time: 784.433 ns (0.00% GC)
mean time: 792.276 ns (0.00% GC)
maximum time: 1.226 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 104
julia> @benchmark @. $B = $A + $X'
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 805.562 ns (0.00% GC)
median time: 816.124 ns (0.00% GC)
mean time: 821.296 ns (0.00% GC)
maximum time: 1.146 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 89
LoopVectorization is more extreme:
julia> using LoopVectorization
julia> @benchmark @avx @. $B = $A + $x'
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 592.760 ns (0.00% GC)
median time: 594.268 ns (0.00% GC)
mean time: 597.365 ns (0.00% GC)
maximum time: 800.821 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 179
julia> @benchmark @avx @. $B = $A + $X'
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.022 μs (0.00% GC)
median time: 1.027 μs (0.00% GC)
mean time: 1.028 μs (0.00% GC)
maximum time: 3.066 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 10
LLVM will often do “multiversioning”, which means creating extra loops for special cases that can be optimized more, such as the matrix being N
x1
. When LLVM does create code optimized for that special case, the performance difference might not be so large.
LoopVectorization doesn’t, hence we get the larger difference there.
But you have to pick and choose the cases that perform well.
I think LLVM will be faster than LoopVectorization for this next example if you don’t have AVX512 (I’m working on something that should let me address this in the next few weeks!), but if you do:
julia> C = rand(128,32);
julia> @benchmark @. $B = $A + $C'
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 2.534 μs (0.00% GC)
median time: 2.543 μs (0.00% GC)
mean time: 2.546 μs (0.00% GC)
maximum time: 3.700 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 9
julia> @benchmark @avx @. $B = $A + $C'
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.299 μs (0.00% GC)
median time: 1.308 μs (0.00% GC)
mean time: 1.309 μs (0.00% GC)
maximum time: 3.969 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 10
My point being that there are a lot of trade offs in the kinds of cases you optimize for.
Giving the compiler more information, e.g. by using a Vector
instead of a matrix where one axis equals 1
, will help it.