1d arrays vs 1-column matrices

The compiler has more information about a vector than a matrix: it knows the second dimension is 1. This can yield better performance in cases such as:

julia> x = rand(128); X = reshape(x, (length(x),1));

julia> typeof.((x, X))
(Array{Float64,1}, Array{Float64,2})

julia> A = rand(32,128); B = similar(A);

julia> @benchmark @. $B = $A + $x'
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     773.423 ns (0.00% GC)
  median time:      784.433 ns (0.00% GC)
  mean time:        792.276 ns (0.00% GC)
  maximum time:     1.226 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     104

julia> @benchmark @. $B = $A + $X'
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     805.562 ns (0.00% GC)
  median time:      816.124 ns (0.00% GC)
  mean time:        821.296 ns (0.00% GC)
  maximum time:     1.146 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     89

LoopVectorization is more extreme:

julia> using LoopVectorization

julia> @benchmark @avx @. $B = $A + $x'
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     592.760 ns (0.00% GC)
  median time:      594.268 ns (0.00% GC)
  mean time:        597.365 ns (0.00% GC)
  maximum time:     800.821 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     179

julia> @benchmark @avx @. $B = $A + $X'
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.022 μs (0.00% GC)
  median time:      1.027 μs (0.00% GC)
  mean time:        1.028 μs (0.00% GC)
  maximum time:     3.066 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

LLVM will often do “multiversioning”, which means creating extra loops for special cases that can be optimized more, such as the matrix being Nx1. When LLVM does create code optimized for that special case, the performance difference might not be so large.
LoopVectorization doesn’t, hence we get the larger difference there.

But you have to pick and choose the cases that perform well.
I think LLVM will be faster than LoopVectorization for this next example if you don’t have AVX512 (I’m working on something that should let me address this in the next few weeks!), but if you do:

julia> C = rand(128,32);

julia> @benchmark @. $B = $A + $C'
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.534 μs (0.00% GC)
  median time:      2.543 μs (0.00% GC)
  mean time:        2.546 μs (0.00% GC)
  maximum time:     3.700 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9

julia> @benchmark @avx @. $B = $A + $C'
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.299 μs (0.00% GC)
  median time:      1.308 μs (0.00% GC)
  mean time:        1.309 μs (0.00% GC)
  maximum time:     3.969 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

My point being that there are a lot of trade offs in the kinds of cases you optimize for.
Giving the compiler more information, e.g. by using a Vector instead of a matrix where one axis equals 1, will help it.

9 Likes