# [ANN] LoopVectorization

I may be misreading the code, but it appears the comparison is never made with OpenBLAS. I think it would be of general interest to Julia programmers to see how that compares with the other options. Would it be possible to add it?

I am wondering why my measurements ([ANN] LoopVectorization) are inconsistent with the first figure at the top in the OP.
The AVX-enabled matrix-matrix multiplication only beats OpenBLAS for matrix sizes below 32.
For matrix sizes bigger, OpenBLAS beats the AVX code handily.

Is it just my particular architecture (only AVX), or is there something else going on? In particular, it is unclear whether the intrinsic Fortran is BLAS or MKL or neither of these two.

Maybe you’re using multithreaded BLAS? Earlier it was mentioned that all the benchmarks were done single-threaded.

You are right, I should have posted my code. However, as you can see I did set the number of threads.

``````module mlatest1
using Test
using BenchmarkTools
using LoopVectorization
using LinearAlgebra

function gemmavx!(C, A, B)
M, N = size(C); K = size(B,1)
C .= 0
@avx for n ∈ 1:N, k ∈ 1:K
Bkn = B[k,n]
for m ∈ 1:M
C[m,n] += A[m,k] * Bkn
end
end
return C
end

function gemmsimd!(C, A, B)
M, N = size(C); K = size(B,1)
C .= 0
@inbounds for n ∈ 1:N, k ∈ 1:K
Bkn = B[k,n]
@simd for m ∈ 1:M
C[m,n] += A[m,k] * Bkn
end
end
return C
end

function gemm!(C, A, B)
M, N = size(C); K = size(B,1)
C .= 0
for n ∈ 1:N, k ∈ 1:K
Bkn = B[k,n]
@inbounds for m ∈ 1:M
C[m,n] += A[m,k] * Bkn
end
end
return C
end

function gemmblas!(C, A, B)
mul!(C, A, B)
return C
end

function test(n)
println("n = ", n)
C, A, B = rand(n, n), rand(n, n), rand(n, n)
@test norm(gemm!(C, A, B) - A * B) / norm(C) <= 1.0e-9
@test norm(gemmsimd!(C, A, B) - A * B) / norm(C) <= 1.0e-9
@test norm(gemmblas!(C, A, B) - A * B) / norm(C) <= 1.0e-9
@test norm(gemmavx!(C, A, B) - A * B) / norm(C) <= 1.0e-9

println("gemm!")
@btime gemm!(\$C, \$A, \$B)
println("gemmsimd!")
@btime gemmsimd!(\$C, \$A, \$B)
println("gemmblas!")
@btime gemmblas!(\$C, \$A, \$B)
println("gemmavx!")
@btime gemmavx!(\$C, \$A, \$B)
end
end

using .mlatest1
using LinearAlgebra
mlatest1.test(2^7)
mlatest1.test(2^6)
mlatest1.test(2^5)
mlatest1.test(2^4)
mlatest1.test(2^3)
``````

Results:

``````n = 128
gemm!
769.401 μs (0 allocations: 0 bytes)
gemmsimd!
769.400 μs (0 allocations: 0 bytes)
gemmblas!
212.400 μs (0 allocations: 0 bytes)
gemmavx!
271.500 μs (0 allocations: 0 bytes)
n = 64
gemm!
90.600 μs (0 allocations: 0 bytes)
gemmsimd!
91.200 μs (0 allocations: 0 bytes)
gemmblas!
34.099 μs (0 allocations: 0 bytes)
gemmavx!
35.999 μs (0 allocations: 0 bytes)
n = 32
gemm!
13.999 μs (0 allocations: 0 bytes)
gemmsimd!
14.299 μs (0 allocations: 0 bytes)
gemmblas!
5.083 μs (0 allocations: 0 bytes)
gemmavx!
4.550 μs (0 allocations: 0 bytes)
n = 16
gemm!
2.022 μs (0 allocations: 0 bytes)
gemmsimd!
2.022 μs (0 allocations: 0 bytes)
gemmblas!
1.120 μs (0 allocations: 0 bytes)
gemmavx!
754.703 ns (0 allocations: 0 bytes)
n = 8
gemm!
507.325 ns (0 allocations: 0 bytes)
gemmsimd!
649.390 ns (0 allocations: 0 bytes)
gemmblas!
370.098 ns (0 allocations: 0 bytes)
gemmavx!
147.250 ns (0 allocations: 0 bytes)
``````

Version:

``````Julia Version 1.4.0-DEV.653
Platform Info:
OS: Windows (x86_64-w64-mingw32)
Microsoft Windows [Version 10.0.18363.535]
CPU: Intel(R) Core(TM) i7-2670QM CPU @ 2.20GHz:
speed         user         nice          sys         idle          irq
#1  2195 MHz   52714515            0     57776921    298672296     11381156  ticks
#2  2195 MHz   41877640            0     21242015    346043796      4266359  ticks
#3  2195 MHz   68650078            0     33686562    306826812      1459062  ticks
#4  2195 MHz   45890843            0     30707765    332564843       667515  ticks
#5  2195 MHz   65329281            0     39737968    304096203      1166953  ticks
#6  2195 MHz   41304109            0     11175343    356684000       363546  ticks
#7  2195 MHz   64453953            0     27187625    317521875       842328  ticks
#8  2195 MHz   51644437            0     16455250    341063765       386968  ticks

Memory: 15.896644592285156 GB (7552.83984375 MB free)
Uptime: 1.034746e6 sec
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-8.0.1 (ORCJIT, sandybridge)
Environment:
ACLOCAL_PATH = C:\Program Files\Git\mingw64\share\aclocal;C:\Program Files\Git\usr\share\aclocal
HOME = C:\Users\PKrysl
EXEPATH = C:\Program Files\Git
MANPATH = C:\Program Files\Git\mingw64\local\man;C:\Program Files\Git\mingw64\share\man;C:\Program Files\Git\usr\local\man;C:\Program Files\Git\usr\share\man;C:\Program Files\Git\usr\man;C:\Program Files\Git\share\man
TERM = xterm
INFOPATH = C:\Program Files\Git\usr\local\info;C:\Program Files\Git\usr\share\info;C:\Program Files\Git\usr\info;C:\Program Files\Git\share\info
HOMEPATH = \Users\PKrysl
HOMEDRIVE = C:
PATHEXT = .COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.MSC
PKG_CONFIG_PATH = C:\Program Files\Git\mingw64\lib\pkgconfig;C:\Program Files\Git\mingw64\share\pkgconfig
PSMODULEPATH = C:\Program Files\WindowsPowerShell\Modules;C:\Windows\system32\WindowsPowerShell\v1.0\Modules
PATH = C:\Users\PKrysl\bin;C:\Program Files\Git\mingw64\bin;C:\Program Files\Git\usr\local\bin;C:\Program Files\Git\usr\bin;C:\Program Files\Git\usr\bin;C:\Program Files\Git\mingw64\bin;C:\Program Files\Git\usr\bin;C:\Users\PKrysl\bin;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0;C:\Windows\System32\OpenSSH;C:\Program Files\Git\cmd;C:\Program Files (x86)\Bitvise SSH Client;C:\Program Files\MiKTeX 2.9\miktex\bin\x64;C:\Users\PKrysl\AppData\Local\Microsoft\WindowsApps;C:\Program Files\Git\usr\bin\vendor_perl;C:\Program Files\Git\usr\bin\core_perl
ORIGINAL_PATH = C:\Program Files\Git\mingw64\bin;C:\Program Files\Git\usr\bin;C:\Users\PKrysl\bin;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0;C:\Windows\System32\OpenSSH;C:\Program Files\Git\cmd;C:\Program Files (x86)\Bitvise SSH Client;C:\Program Files\MiKTeX 2.9\miktex\bin\x64;C:\Users\PKrysl\AppData\Local\Microsoft\WindowsApps
``````
2 Likes

I was trying the following code for `dot(x, A, y)` (which should be available in Julia 1.4 I think), to see if `@avx` would improve performance here.

``````using LinearAlgebra, LoopVectorization

dot3(x, A, y) = dot(x, A * y)
function dot3avx(x::AbstractVector, A::AbstractMatrix, y::AbstractVector)
(axes(x)..., axes(y)...) == axes(A) || throw(DimensionMismatch())
T = typeof(dot3(first(x), first(A), first(y)))
s = zero(T)
i₁ = first(eachindex(x))
x₁ = first(x)
@avx for j in eachindex(y)
yj = y[j]
if !iszero(yj)
for i in eachindex(x)
end
s += dot(temp, yj)
end
end
return s
end
``````

But it gives an error (just from trying to evaluate the function definition):

``````"Don't know how to handle expression:\nif !(iszero(yj))\n    #= In[22]:10 =#\n    temp = zero(adjoint(A[i₁, j]) * x₁)\n    #= In[22]:11 =#\n    for i = eachindex(x)\n        #= In[22]:12 =#\n        temp = vmuladd(adjoint(A[i, j]), x[i], temp)\n    end\n    #= In[22]:14 =#\n    s = s + dot(temp, yj)\nend"

Stacktrace:
[1] push!(::LoopVectorization.LoopSet, ::Expr, ::Int64) at /home/cossio/.julia/packages/LoopVectorization/J73NG/src/graphs.jl:557
[2] add_block!(::LoopVectorization.LoopSet, ::Expr, ::Int64) at /home/cossio/.julia/packages/LoopVectorization/J73NG/src/graphs.jl:214
[3] add_loop!(::LoopVectorization.LoopSet, ::Expr, ::Int64) at /home/cossio/.julia/packages/LoopVectorization/J73NG/src/graphs.jl:265
[5] copyto! at /home/cossio/.julia/packages/LoopVectorization/J73NG/src/constructors.jl:6 [inlined]
[6] LoopVectorization.LoopSet(::Expr) at /home/cossio/.julia/packages/LoopVectorization/J73NG/src/constructors.jl:46
[7] @avx(::LineNumberNode, ::Module, ::Any) at /home/cossio/.julia/packages/LoopVectorization/J73NG/src/constructors.jl:53
``````

Any suggestions?

Is it possible to take advantage of vectorization with a branch in there?

If I remove the branch I get a different error:

``````dot3(x, A, y) = dot(x, A * y)
function dot3avx(x::AbstractVector, A::AbstractMatrix, y::AbstractVector)
(axes(x)..., axes(y)...) == axes(A) || throw(DimensionMismatch())
T = typeof(dot3(first(x), first(A), first(y)))
s = zero(T)
i₁ = first(eachindex(x))
x₁ = first(x)
@avx for j in eachindex(y)
yj = y[j]
for i in eachindex(x)
end
s += dot(temp, yj)
end
return s
end
``````
``````DomainError with -1.033113494964142e17:
sqrt will only return a complex result if called with a complex argument. Try sqrt(Complex(x)).

Stacktrace:
[1] throw_complex_domainerror(::Symbol, ::Float64) at ./math.jl:32
[2] sqrt at ./math.jl:492 [inlined]
[3] solve_tilesize(::SubArray{Float64,1,Array{Float64,2},Tuple{Base.Slice{Base.OneTo{Int64}},Int64},true}, ::SubArray{Int64,1,Array{Int64,2},Tuple{Base.Slice{Base.OneTo{Int64}},Int64},true}) at /home/cossio/.julia/packages/LoopVectorization/J73NG/src/determinestrategy.jl:172
[4] solve_tilesize(::SubArray{Float64,1,Array{Float64,2},Tuple{Base.Slice{Base.OneTo{Int64}},Int64},true}, ::SubArray{Int64,1,Array{Int64,2},Tuple{Base.Slice{Base.OneTo{Int64}},Int64},true}, ::Int64, ::Int64) at /home/cossio/.julia/packages/LoopVectorization/J73NG/src/determinestrategy.jl:221
[5] solve_tilesize(::LoopVectorization.LoopSet, ::Symbol, ::Symbol, ::SubArray{Float64,1,Array{Float64,2},Tuple{Base.Slice{Base.OneTo{Int64}},Int64},true}, ::SubArray{Int64,1,Array{Int64,2},Tuple{Base.Slice{Base.OneTo{Int64}},Int64},true}) at /home/cossio/.julia/packages/LoopVectorization/J73NG/src/determinestrategy.jl:245
[6] evaluate_cost_tile(::LoopVectorization.LoopSet, ::Array{Symbol,1}) at /home/cossio/.julia/packages/LoopVectorization/J73NG/src/determinestrategy.jl:317
[7] choose_tile at /home/cossio/.julia/packages/LoopVectorization/J73NG/src/determinestrategy.jl:387 [inlined]
[8] choose_order(::LoopVectorization.LoopSet) at /home/cossio/.julia/packages/LoopVectorization/J73NG/src/determinestrategy.jl:401
[9] lower(::LoopVectorization.LoopSet) at /home/cossio/.julia/packages/LoopVectorization/J73NG/src/lowering.jl:809
[10] @avx(::LineNumberNode, ::Module, ::Any) at /home/cossio/.julia/packages/LoopVectorization/J73NG/src/constructors.jl:53
``````

First of all, thanks to everyone who tried it out and filed an issue, commented here, or sent me a message when something didn’t work.
I’ve fixed a lot of bugs, added the broken examples to the test cases, and created new releases.
LoopVectorization’s latest release is currently 0.2.2.

``````@avx @. dsdt = ((1.0 + tanh(k_s * (u - u_s)) / 2.0 - s) / τ_s
``````

as a test case. I haven’t tried it yet, but it should be fixed (the example from issue 9 is).
Also, if you really want to optimize it, it may be worth experimenting with breaking up the broadcast in a few pieces.
A function like `tanh` is probably going to eat all your registers, causing spills. Evaluating pieces in separate loops may actually be faster.

Try this much simpler version:

``````function dot3avx(x, A, y)
M, N = size(A)
s = zero(promote_type(eltype(x), eltype(A), eltype(y)))
@avx for m ∈ 1:M, n ∈ 1:N
s += x[m] * A[m,n] * y[n]
end
s
end
``````

Your version was written to be very generic.
However, `@avx` will only work with hardware-native types, like `Float32` or `Float64`, so all the careful behavior like `adjoint(A[i,j])` won’t do anything.

Furthermore, defining those new temps are likely to get in the way of optimizations. LoopVectorization does not (currently?) eliminate anything or simplify your expression.
The simpler the expression, the easier it is to model it, and transform it into something else that’s faster. Anything extra you add can only stop it from doing something.

It doesn’t handle branches yet, but it’d be great to handle them eventually.

Currently, if it sees the function `zero`, it assumes it’s a constant, doesn’t inspect the arguments (it should be type stable anyway, so it shouldn’t be anything that changes!) and hoists it out of the loop. In your example, that would cause an undef var error since `j` wasn’t defined yet.

LoopVectorization is currently ignorant of `adjoint`.
Maybe I should add functions like it and `identity` (adjoint would be treated identically to identity). When it is ignorant of a function, it assumes that function is expensive enough that doing fancy stuff will cause more harm than good. Unrolling `+` a bunch of times is okay. Unrolling `f(x) = exp(x) + sin(x)` isn’t a great idea.

Still, it shouldn’t give the second error you got, so I’ll look into it.

All the computers I have at home (where I ran the benchmarks) have avx512, where OpenBLAS performs embarrassingly badly ;).
I did run a few benchmarks, but thought presenting them would misrepresent the performance.
Would probably be a good idea to run them on my laptop.

@baggepinnen
That surprises me. `vmap` doesn’t do anything special.
You can see what `vmap!` actually looks like (ignoring the generated function I used to make it variadic; maybe I should’ve just used splats) on a computer with avx2:

``````julia> LoopVectorization.vmap_quote(2, Float64) # number of arguments, element type
quote
M = length(dest)
vdest = vectorizable(dest)
m = 0
varg_1 = vectorizable(args[1])
varg_2 = vectorizable(args[2])
for _ = 0:M >>> 2 - 1
m += 4
end
if m != M
end
dest
end
``````

(`vmap` just allocates the output, and then calls `vmap!`)
Using `Float32` will of course double the vector sizes.

That is, it just calls the function using `SVec`s as arguments, and hopes code was written generically enough for that to work.

It seems (confirmed via code_native/code_llvm) that `map` isn’t vectorized at all, explaining the performance difference. `broadcast` of course is vectorized.
Seems like an issue with map. Probably unintentional?

MLIR is definitely exciting.
The new Fortran compiler that will be mainlined into LLVM is planning on using MLIR and a Fortran-specific dialect (FIR).

Why did I work on LoopVectorization? My primary motivation is working on improving codegen and performance of ProbabilityModels.jl, which I hope to get in a state worth looking at within the next few months.
One of the things I wanted to be able to do was source to source autodiff for broadcasts and simple loops. This requires that I have some representation of the loops I can work on. If I’m already doing that, it only makes sense to try and make them fast, too.

17 Likes

Looks like a great package!

• My impression from your post is that @avx only supports native float types, yet it appears to support `Int64`, `Int32`, `Int16` as well. Is this expected?
• Is there a possibility of supporting loops which contain a mix of Int and Float types in the future?
• My understanding is that there’s no hope of accelerating non-native types like `Complex{Float64}` or user-defined types like those from DoubleFloats anytime soon. Even so, is there possibility of the `@avx` macro just not doing anything in that case? I.e. not fail? With the current state, it’s harder to write a single method that generalizes to multiple types.

Thanks!

2 Likes

There’s no reason it shouldn’t, but I haven’t been testing it, and some parts weren’t written generically. I’ve since fixed it, but without testing I can’t make promises.

It’s also worth pointing out that AVX (without AVX2 or higher) doesn’t support SIMD integer operations. AVX2 added `Int32` and `Int64` SIMD support, while AVX512BW (everything with AVX512F except Xeon Phi) added `Int8` and `Int16`.

This means that if I wanted to guarantee good performance, I’d have to do a lot more checks. As it is, on a computer with only AVX, it would try to use integer vectors, which would be slow. On a computer with AVX2, it would still try to use vectors of 16 `Int16`, but this would again be slow, because the hardware doesn’t support it.
The assembly would get ugly when tiling, with a mountain of register spills, etc.

I think some examples should already work.
If you have an example that doesn’t, please file an issue.

They can be “feature requests” as well.

These are harder, but far from hopeless.
I think the trick (as it so often is) would be to abuse multiple dispatch.

The plan would be (same idea for `DoubleFloat`):

1. Define `stridedpointer` methods for `::AbstractArray{Complex{T}}` that return something such that `vload` and `vstore` do the right thing (load/store two vectors after shuffling them).
2. Make sure arithmetic functions work correctly for the vector pairs (ie, for complex, one vector for the real numbers, one for the imaginary).
3. Come up with a way to tell the code generation that it ought to load two registers for each of these pair-structs.
4. Find a way to tell code generation that each load eats 2 registers. 4. Also tell it what the actual cost of each operation is on these types; the pattern will be a little different, so it could make poor choices.

“3.” is by far the hardest.

A simple patch fix we can do for now is have the macro check the element types of the arrays (as it already does with a `promote_type` on the `eltype`s), and then have an if/else check on whether it’s supported. If not, it’ll just run the loop as written.

A better fix that would let us use type information while optimizing is to implement the loop macro more like broadcasting: have it transform the loops into expression templates, and then use an `@generated` function to materialize it. That function would have access to the types (like the broadcast), and could thus be much more flexible in terms of what it can handle.

5 Likes

For dealing with things like Complex or Dual numbers, I think a much easier solution might be to transform from Arrays of Structs to Structs of Arrays, I.e. using https://github.com/JuliaArrays/StructArrays.jl

In my very limited understanding, those Structs of arrays would be much easier to deal with for this package.

4 Likes

Very interesting! Thank you!

I have been using Cartesian (to write generic tensor operations) on floats and other types (automatic differentiation). Would you say that LoopVectorization aims at replacing Cartesian? Any comments on how both packages compare?

Here’s a little demo of using StructArrays with LoopVectorization for those interested. This is a demo gemm kernel that does slightly better than BLAS (`LinearAlgebra.mul!`) for Float64 inputs:

``````using LoopVectorization, LinearAlgebra, StructArrays, BenchmarkTools, Test

const MatrixFInt64 = Union{Matrix{Float64}, Matrix{Int}}

function mul_avx!(C::MatrixFInt64, A::MatrixFInt64, B::MatrixFInt64)
z = zero(eltype(C))
@avx for i ∈ 1:size(A,1), j ∈ 1:size(B,2)
Cᵢⱼ = z
for k ∈ 1:size(A,2)
Cᵢⱼ += A[i,k] * B[k,j]
end
C[i,j] = Cᵢⱼ
end
end

M, K, N = 50, 51, 52
A = randn(M, K); B = randn(K, N)
C1 = Matrix{Float64}(undef, M, N);
C2 = similar(C1);

@btime mul_avx!(\$C1, \$A, \$B)
@btime mul!(    \$C2, \$A, \$B)
@test C1 ≈ C2
``````

which gives an output

``````BLAS.vendor() = :openblas64
5.594 μs (0 allocations: 0 bytes)
8.741 μs (0 allocations: 0 bytes)
Test Passed
``````

Now, we use StructArrays to extend this to complex matrices:

``````function mul_add_avx!(C::MatrixFInt64, A::MatrixFInt64, B::MatrixFInt64, factor=1)
z = zero(eltype(C))
@avx for i ∈ 1:size(A,1), j ∈ 1:size(B,2)
ΔCᵢⱼ = z
for k ∈ 1:size(A,2)
ΔCᵢⱼ += A[i,k] * B[k,j]
end
C[i,j] += factor * ΔCᵢⱼ
end
end

const StructMatrixComplexFInt64 = Union{StructArray{ComplexF64,2}, StructArray{Complex{Int},2}}

function mul_avx!(C:: StructMatrixComplexFInt64, A::StructMatrixComplexFInt64, B::StructMatrixComplexFInt64)
mul_avx!(  C.re, A.re, B.re)       # C.re = A.re * B.re
mul_add_avx!(C.re, A.im, B.im, -1) # C.re = C.re - A.im * B.im
mul_avx!(  C.im, A.re, B.im)       # C.im = A.re * B.im
mul_add_avx!(C.im, A.im, B.re)     # C.im = C.im + A.im * B.re
end

A  = StructArray(randn(ComplexF64, M, K));
B  = StructArray(randn(ComplexF64, K, N));
C1 = StructArray(Matrix{ComplexF64}(undef, M, N));
C2 = collect(similar(C1));

@btime mul_avx!(\$C1, \$A, \$B)
@btime mul!(    \$C2, \$(collect(A)), \$(collect(B))) # collect turns the StructArray into a regular Array
@test  collect(C1) ≈ C2
``````

giving

``````  30.696 μs (0 allocations: 0 bytes)
27.582 μs (0 allocations: 0 bytes)
Test Passed
``````

so we go from beating OpenBlas to doing slightly worse when we switch to complex matrices.

It seems that users won’t necessarily be able to use this approach for any given problem, but it works well at least for things like matrix multiply / tensor contractions. On the library side, I think it’s possible that StructArrays could be integrated into the `@avx` macro in such a way that all this works normally on complex and dual numbers, but that’s more speculative.

Edit: With the new v0.3.2 update, the codegen for the `mul_add_avx` function was improved so the timing for the complex struct matrix went from ` 30.696 µs` to `22.988`, now beating OpenBLAS.

10 Likes

Have you tried the 3m method? That should be faster for these sizes.

Curious effect when using `@avx`: This code for the matrix-matrix multiplication has a slight edge without `@avx`

``````function mulCAB1!(C, A, B)
M, N = size(C); K = size(B,1)
C .= 0
for n in 1:N, k in 1:K
Bkn = B[k,n]
for m in 1:M
C[m,n] += A[m,k] * Bkn
end
end
return C
end
``````

over this code

``````function mulCAB2!(C, A, B)
M, N = size(C); K = size(B,1)
z = zero(eltype(C))
for m ∈ 1:M, n ∈ 1:N
Cmn = z
for k ∈ 1:K
Cmn += A[m,k] * B[k,n]
end
C[m,n] = Cmn
end
return C
end
``````

which is faster when the `@avx` macro is used (in front of the outermost loop; obviously either both functions have it or neither one does).
The first version is usually recommended as the fastest in CS books.

Of course, without the `@avx` macro both functions are an order of magnitude slower than when instrumented with the macro. I tested this for matrices of size 64, 32, 16, and 8. BLAS was faster only for size 64.

1 Like

I’ll try to start working on proper documentation, which will hopefully make `@avx` and its performance more transparent.
It’s worth noting here however that LoopVectorization uses a `LoopSet` struct as its internal representation of a set of loops.
This struct contains little more than a(n unordered!) collection of the loops, as well as a collection of operations performed in the loop. These operations contain their dependencies on other operations and on loops.

`@avx` on loops builds a `LoopSet`, which it then converts into an expression.
`@avx` on broadcasts swaps `materialize(!)` for `vmaterialize(!)`, which is a generated function that creates a `LoopSet`, and then converts it into an expression.

The `LoopSet` does not contain information on the original order of the loops, or where the operations were written (order still matters insofar as defining dependencies between operations).

However, the orders the loops can eventually be placed in can be restricted. Writing:

``````    for m ∈ 1:M, n ∈ 1:N
Cmn = z
for k ∈ 1:z
Cmn += A[m,k] * B[k,n]
end
C[m,n] = Cmn
end
``````

Here, because `Cmn` is redefined as a constant for every `m` and `n`, it is taken as a hint to depend on the `m` and `n` loops.
Right now, that forces the `m` and `n` loops to be exterior to where `Cmn` is ultimately placed.
Additionally, `Cmn` is then summed over `k`. That forces `k`; the `k` loop thus depends on `Cmn` and must be internal with respect to `Cmn`'s definition.

Thus it only considers two possible orders (exterior to interior): `m`, `n`, `k` and `n`, `m`, `k`.
One of these is hopefully pretty good.

If instead `Cmn = C[m,n]`, it would not infer the restriction that the `m` and `n` loops must be external to that operation. It will instead assuming that if it reorders, it will keep rereading the correct value from memory.
Similarly, if we want to reoder the loops ourselves, we have to replace the `Cmn = zero(eltype(C))` with a `C .= 0` external to all the loops, and switch to using `Cmn = C[m,n]` instead.

That is a transformation we are capable of, but that I haven’t implemented in `LoopVectorization` yet.

All that said, given the unrestricted version of the loop, what does `@avx` want to do with it?

``````julia> using LoopVectorization: LoopSet, choose_order
[ Info: Precompiling LoopVectorization [bdcacae8-1622-11e9-2a5c-532679323890]

julia> nkm_q = :(for n in 1:N, k in 1:K
Bkn = B[k,n]
for m in 1:M
C[m,n] += A[m,k] * Bkn
end
end);

julia> nkm_ls = LoopSet(nkm_q);

julia> choose_order(nkm_ls)
([:n, :m, :k], :m, 4, 4)
``````

This happens to be one of the two possible orders of the more restricted version.
Note, the integers will change depending on your CPU architecture. The symbols probably wont.

What this means is that the loop order it choose is (exterior to interior): `n`, `m`, `k`.
The second `m` means that is that SIMD vectors are loaded from the `m` loop. That is, adjacent elements along the `m` axis will be loaded into adjacent SIMD lanes.
If a number doesn’t depend on the `m` loop (like `B[k,n]`) it will insteadl be broadcast across a vector, so each element is equal.
I call this the vectorized loop. The vectorized loop is unrolled by the vector width.

The reason that the `nkm` loop is faster with base Julia/`@simd` is because that allows the `m` loop to be vectorized, while the `mnk` loop cannot.

The `@avx` macro can load SIMD vectors along the `m` axis regardless of the loop order it actually chooses.

The two integers, in this case `(U, T) = 4, 4`, refer to how the loops are unrolled. If `T == -1`, that means the outermost loop is unrolled by `U` (or `U` * vector width, if it is also the vectorized loop).
Otherwise, the outermost loop is unrolled by `T`, and the second outermost loop is unrolled by `U` (or `U` * vector width).
What this unrolling means in this example is that it will:

1. Create `U*T` vectors of `Cmn`.
2. It will start looping over `k`. On each iteration, it will…
3. Load `U` vectors from `A`.
4. for t = 1,…,T, it will load an element from `B`, and update the `U` vectors from `Cmn` that correspond to that `t` (using the `U` vectors from `A`).
5. Finish looping, and store the `Cmn` vectors into `C`.

So on each iteration of `k`, you perform a total of `U+T` loads (`U` vector loads from `A`, `T` broadcasts to fill a register with a single value from `B`). However, you perform a total of `U * T` multiplications and additions, without any of the multiplications-and-additions depending on any others, so that they can all take advantage of out of order execution.
That is, it should be able to keep the CPU’s fma (fused multiply-add) units nearly 100% busy.

This requires a total of `U*T` registers for `Cmn`, `U` registers for `A`, and 1 register for `B`, or `U*T + U + 1`. For `U = T = 4`, that is 21 registers.
Computers with only `AVX2` have 16 floating point registers per core, so `choose_order` should instead return `U = 3; T = 4`, yielding exactly 16.
It has to require less registers than the CPU has, otherwise it would have to do a bunch of extra loads and stores on each `k` iteration, cratering the compute / load ratio.

So, @PetrKryslUCSD, `@avx` does almost the same thing with either loop, because it doesn’t pay too much attention to the actual order you wrote them in. The reason `mulCAB1!` is slower is because `C .= 0` is slower than not doing that, and then within the loops, loading values from `C[m,n]` is also slightly slower than simply assigning 0.

In both cases, it is able to vectorize the `m` loop, while `@simd` is only capable of doing that if the `m` loop is the most interior loop.

Wow, that’s fantastic!
Performance-wise, it is much simpler than trying to get a macro (which doesn’t see types!) to be able to do something fast.

The ratio of performance of your complex to real example was about 5.5.
I just released a new version of LoopVectorization (v"0.3.2") that improved the code generation for your `mul_add_avx!`, which now gives me a factor of about 4.2.
This is still behind the ideal `4`, but much closer!

The problem was that it did not realize it could place the load from `C[i,j]` in `C[i,j] += factor * ΔCᵢⱼ` after the loop, because I had naively been placing all loads prior to any interior loops.
This hurt because it then counted that as an additional occupied register (changing the values of `U` and `T`).
Adding a check to see if it is allowed to delay operations until after the loop, and then doing so if allowed, largely solved the problem.

OpenBLAS and MKL seem to have ratios of <4. IIRC, they default to 4m, so I’m guessing that’s because of BLAS overhead being amortized.

I wouldn’t say it aims to replace Cartesian, which aims to make it easier to write generic loops.
LoopVectorization aims to make loops you have written faster.

EDIT: I updated the opening post with more benchmarks.

14 Likes

Wow, thanks for the quick update! I installed the update and re-ran my benchmarks, finding negligible change in the `Matrix{Float64}`benchmarks ( `5.594 -> 5.603 µs`) but the `StructArray{ComplexF64, 2}` went from `30.696 µs` to `22.988 μs`, so I’m observing a 4.1x increased cost for the complex matrix gemm.

Another thing that’s worth pointing out about the above gemm kernels is that they absolutely smoke `LinearAlgebra.mul!` for matrices of integers. The reason is that OpenBLAS doesn’t do integer matrix multiplication, so `mul!` will fall back on a generic kernel.

Using the same setup code from my post above, I find

``````A = rand(-100:100, M, K); B = rand(-100:100, K, N)
C1 = Matrix{eltype(A)}(undef, M, N);
C2 = similar(C1);

@btime mul_avx!(\$C1, \$A, \$B)
@btime mul!(    \$C2, \$A, \$B)
``````

gives

`````` 28.842 μs (0 allocations: 0 bytes)
65.560 μs (6 allocations: 336 bytes)
``````

and for complex matrices

``````A  = StructArray(rand(-100:100, M, K) + im * rand(-100:100, M, K));
B  = StructArray(rand(-100:100, K, N) + im * rand(-100:100, K, N));
C1 = StructArray(Matrix{eltype(A)}(undef, M, N));
C2 = collect(similar(C1));

@btime mul_avx!(\$C1, \$A, \$B)
@btime mul!(    \$C2, \$(collect(A)), \$(collect(B)))
``````

gives

`````` 116.938 μs (0 allocations: 0 bytes)
232.860 μs (6 allocations: 336 bytes)
``````

I’m sure that with an algorithmic improvement, the advantage for `Complex{Int}` would be even greater.

10 Likes

I tried - naively it turns out - to vectorise a piece of code I’ve long wanted to optimise. A MWE is the following:

``````using BenchmarkTools, LoopVectorization

maxdeg = 20
nbasis = 10_000
dim = 10
basis = rand(1:(maxdeg+1), (dim, nbasis))
coeffs = rand(nbasis)
P = rand(dim, maxdeg+1)

function mvp(P, basis, coeffs::Vector{T}) where {T}
len_c = length(coeffs)
len_P = size(P, 1)
p = zero(T)
for n = 1:len_c
pn = coeffs[n]
for a = 1:len_P
pn *= P[a, basis[a, n]]
end
p += pn
end
return p
end

@btime mvp(\$P, \$basis, \$coeffs)
``````

Using the `@avx` macro on either the inner or outer loop gives me the following error:

``````MethodError: Cannot `convert` an object of type Expr to an object of type Union{Int64, Symbol}
``````

It has something to do with indexing via `basis[a, n]`. Can anybody explain to me (1) why my naive approach won’t work and (2) whether this code could still be vectorised?

1 Like

It can be vectorized using gather instructions.
These instructions are slow, which is one reason compilers don’t often use them.
I also didn’t add support to `LoopVectorization` for that yet, but that is something I very much want to support. In statistics, it’s common to write code like when you have random effects tied to categorical variables.

So I will work on supporting this with LoopVectorization shortly. I’ll update here (or on a related GitHub issue) once it does.

LoopVectorization will use LLVM’s `gather` intrinsics. Note that only AVX2 and AVX512 provide hardware support/corresponding assembly instructions for gather (as far as x86 goes; GPUs support it).
The code will still work on computers without it, but will almost certainly be slow. Meaning you’d need at least AVX2.

Even with it, gather isn’t fast. But it would be well worth testing for each particular loop, so a convenient way to turn it off and on (by adding and removing `@avx`) would be great.

5 Likes

Thanks for the explanation. I look forward to trying this.