I may be misreading the code, but it appears the comparison is never made with OpenBLAS. I think it would be of general interest to Julia programmers to see how that compares with the other options. Would it be possible to add it?
I am wondering why my measurements ([ANN] LoopVectorization - #5 by PetrKryslUCSD) are inconsistent with the first figure at the top in the OP.
The AVX-enabled matrix-matrix multiplication only beats OpenBLAS for matrix sizes below 32.
For matrix sizes bigger, OpenBLAS beats the AVX code handily.
Is it just my particular architecture (only AVX), or is there something else going on? In particular, it is unclear whether the intrinsic Fortran is BLAS or MKL or neither of these two.
Maybe you’re using multithreaded BLAS? Earlier it was mentioned that all the benchmarks were done single-threaded.
You are right, I should have posted my code. However, as you can see I did set the number of threads.
module mlatest1
using Test
using BenchmarkTools
using LoopVectorization
using LinearAlgebra
function gemmavx!(C, A, B)
M, N = size(C); K = size(B,1)
C .= 0
@avx for n ∈ 1:N, k ∈ 1:K
Bkn = B[k,n]
for m ∈ 1:M
C[m,n] += A[m,k] * Bkn
end
end
return C
end
function gemmsimd!(C, A, B)
M, N = size(C); K = size(B,1)
C .= 0
@inbounds for n ∈ 1:N, k ∈ 1:K
Bkn = B[k,n]
@simd for m ∈ 1:M
C[m,n] += A[m,k] * Bkn
end
end
return C
end
function gemm!(C, A, B)
M, N = size(C); K = size(B,1)
C .= 0
for n ∈ 1:N, k ∈ 1:K
Bkn = B[k,n]
@inbounds for m ∈ 1:M
C[m,n] += A[m,k] * Bkn
end
end
return C
end
function gemmblas!(C, A, B)
mul!(C, A, B)
return C
end
function test(n)
println("n = ", n)
C, A, B = rand(n, n), rand(n, n), rand(n, n)
@test norm(gemm!(C, A, B) - A * B) / norm(C) <= 1.0e-9
@test norm(gemmsimd!(C, A, B) - A * B) / norm(C) <= 1.0e-9
@test norm(gemmblas!(C, A, B) - A * B) / norm(C) <= 1.0e-9
@test norm(gemmavx!(C, A, B) - A * B) / norm(C) <= 1.0e-9
println("gemm!")
@btime gemm!($C, $A, $B)
println("gemmsimd!")
@btime gemmsimd!($C, $A, $B)
println("gemmblas!")
@btime gemmblas!($C, $A, $B)
println("gemmavx!")
@btime gemmavx!($C, $A, $B)
end
end
using .mlatest1
using LinearAlgebra
LinearAlgebra.BLAS.set_num_threads(1)
mlatest1.test(2^7)
mlatest1.test(2^6)
mlatest1.test(2^5)
mlatest1.test(2^4)
mlatest1.test(2^3)
Results:
n = 128
gemm!
769.401 μs (0 allocations: 0 bytes)
gemmsimd!
769.400 μs (0 allocations: 0 bytes)
gemmblas!
212.400 μs (0 allocations: 0 bytes)
gemmavx!
271.500 μs (0 allocations: 0 bytes)
n = 64
gemm!
90.600 μs (0 allocations: 0 bytes)
gemmsimd!
91.200 μs (0 allocations: 0 bytes)
gemmblas!
34.099 μs (0 allocations: 0 bytes)
gemmavx!
35.999 μs (0 allocations: 0 bytes)
n = 32
gemm!
13.999 μs (0 allocations: 0 bytes)
gemmsimd!
14.299 μs (0 allocations: 0 bytes)
gemmblas!
5.083 μs (0 allocations: 0 bytes)
gemmavx!
4.550 μs (0 allocations: 0 bytes)
n = 16
gemm!
2.022 μs (0 allocations: 0 bytes)
gemmsimd!
2.022 μs (0 allocations: 0 bytes)
gemmblas!
1.120 μs (0 allocations: 0 bytes)
gemmavx!
754.703 ns (0 allocations: 0 bytes)
n = 8
gemm!
507.325 ns (0 allocations: 0 bytes)
gemmsimd!
649.390 ns (0 allocations: 0 bytes)
gemmblas!
370.098 ns (0 allocations: 0 bytes)
gemmavx!
147.250 ns (0 allocations: 0 bytes)
Version:
Julia Version 1.4.0-DEV.653
Commit f6ca73ad0b (2019-12-22 09:03 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
Microsoft Windows [Version 10.0.18363.535]
CPU: Intel(R) Core(TM) i7-2670QM CPU @ 2.20GHz:
speed user nice sys idle irq
#1 2195 MHz 52714515 0 57776921 298672296 11381156 ticks
#2 2195 MHz 41877640 0 21242015 346043796 4266359 ticks
#3 2195 MHz 68650078 0 33686562 306826812 1459062 ticks
#4 2195 MHz 45890843 0 30707765 332564843 667515 ticks
#5 2195 MHz 65329281 0 39737968 304096203 1166953 ticks
#6 2195 MHz 41304109 0 11175343 356684000 363546 ticks
#7 2195 MHz 64453953 0 27187625 317521875 842328 ticks
#8 2195 MHz 51644437 0 16455250 341063765 386968 ticks
Memory: 15.896644592285156 GB (7552.83984375 MB free)
Uptime: 1.034746e6 sec
Load Avg: 0.0 0.0 0.0
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-8.0.1 (ORCJIT, sandybridge)
Environment:
JULIA_NUM_THREADS = 4
ACLOCAL_PATH = C:\Program Files\Git\mingw64\share\aclocal;C:\Program Files\Git\usr\share\aclocal
HOME = C:\Users\PKrysl
EXEPATH = C:\Program Files\Git
MANPATH = C:\Program Files\Git\mingw64\local\man;C:\Program Files\Git\mingw64\share\man;C:\Program Files\Git\usr\local\man;C:\Program Files\Git\usr\share\man;C:\Program Files\Git\usr\man;C:\Program Files\Git\share\man
TERM = xterm
INFOPATH = C:\Program Files\Git\usr\local\info;C:\Program Files\Git\usr\share\info;C:\Program Files\Git\usr\info;C:\Program Files\Git\share\info
HOMEPATH = \Users\PKrysl
HOMEDRIVE = C:
PATHEXT = .COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.MSC
PKG_CONFIG_PATH = C:\Program Files\Git\mingw64\lib\pkgconfig;C:\Program Files\Git\mingw64\share\pkgconfig
PSMODULEPATH = C:\Program Files\WindowsPowerShell\Modules;C:\Windows\system32\WindowsPowerShell\v1.0\Modules
PATH = C:\Users\PKrysl\bin;C:\Program Files\Git\mingw64\bin;C:\Program Files\Git\usr\local\bin;C:\Program Files\Git\usr\bin;C:\Program Files\Git\usr\bin;C:\Program Files\Git\mingw64\bin;C:\Program Files\Git\usr\bin;C:\Users\PKrysl\bin;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0;C:\Windows\System32\OpenSSH;C:\Program Files\Git\cmd;C:\Program Files (x86)\Bitvise SSH Client;C:\Program Files\MiKTeX 2.9\miktex\bin\x64;C:\Users\PKrysl\AppData\Local\Microsoft\WindowsApps;C:\Program Files\Git\usr\bin\vendor_perl;C:\Program Files\Git\usr\bin\core_perl
ORIGINAL_PATH = C:\Program Files\Git\mingw64\bin;C:\Program Files\Git\usr\bin;C:\Users\PKrysl\bin;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0;C:\Windows\System32\OpenSSH;C:\Program Files\Git\cmd;C:\Program Files (x86)\Bitvise SSH Client;C:\Program Files\MiKTeX 2.9\miktex\bin\x64;C:\Users\PKrysl\AppData\Local\Microsoft\WindowsApps
I was trying the following code for dot(x, A, y)
(which should be available in Julia 1.4 I think), to see if @avx
would improve performance here.
using LinearAlgebra, LoopVectorization
dot3(x, A, y) = dot(x, A * y)
function dot3avx(x::AbstractVector, A::AbstractMatrix, y::AbstractVector)
(axes(x)..., axes(y)...) == axes(A) || throw(DimensionMismatch())
T = typeof(dot3(first(x), first(A), first(y)))
s = zero(T)
i₁ = first(eachindex(x))
x₁ = first(x)
@avx for j in eachindex(y)
yj = y[j]
if !iszero(yj)
temp = zero(adjoint(A[i₁,j]) * x₁)
for i in eachindex(x)
temp += adjoint(A[i,j]) * x[i]
end
s += dot(temp, yj)
end
end
return s
end
But it gives an error (just from trying to evaluate the function definition):
"Don't know how to handle expression:\nif !(iszero(yj))\n #= In[22]:10 =#\n temp = zero(adjoint(A[i₁, j]) * x₁)\n #= In[22]:11 =#\n for i = eachindex(x)\n #= In[22]:12 =#\n temp = vmuladd(adjoint(A[i, j]), x[i], temp)\n end\n #= In[22]:14 =#\n s = s + dot(temp, yj)\nend"
Stacktrace:
[1] push!(::LoopVectorization.LoopSet, ::Expr, ::Int64) at /home/cossio/.julia/packages/LoopVectorization/J73NG/src/graphs.jl:557
[2] add_block!(::LoopVectorization.LoopSet, ::Expr, ::Int64) at /home/cossio/.julia/packages/LoopVectorization/J73NG/src/graphs.jl:214
[3] add_loop!(::LoopVectorization.LoopSet, ::Expr, ::Int64) at /home/cossio/.julia/packages/LoopVectorization/J73NG/src/graphs.jl:265
[4] add_loop! at /home/cossio/.julia/packages/LoopVectorization/J73NG/src/graphs.jl:262 [inlined]
[5] copyto! at /home/cossio/.julia/packages/LoopVectorization/J73NG/src/constructors.jl:6 [inlined]
[6] LoopVectorization.LoopSet(::Expr) at /home/cossio/.julia/packages/LoopVectorization/J73NG/src/constructors.jl:46
[7] @avx(::LineNumberNode, ::Module, ::Any) at /home/cossio/.julia/packages/LoopVectorization/J73NG/src/constructors.jl:53
Any suggestions?
Is it possible to take advantage of vectorization with a branch in there?
If I remove the branch I get a different error:
dot3(x, A, y) = dot(x, A * y)
function dot3avx(x::AbstractVector, A::AbstractMatrix, y::AbstractVector)
(axes(x)..., axes(y)...) == axes(A) || throw(DimensionMismatch())
T = typeof(dot3(first(x), first(A), first(y)))
s = zero(T)
i₁ = first(eachindex(x))
x₁ = first(x)
@avx for j in eachindex(y)
yj = y[j]
temp = zero(adjoint(A[i₁,j]) * x₁)
for i in eachindex(x)
temp += adjoint(A[i,j]) * x[i]
end
s += dot(temp, yj)
end
return s
end
DomainError with -1.033113494964142e17:
sqrt will only return a complex result if called with a complex argument. Try sqrt(Complex(x)).
Stacktrace:
[1] throw_complex_domainerror(::Symbol, ::Float64) at ./math.jl:32
[2] sqrt at ./math.jl:492 [inlined]
[3] solve_tilesize(::SubArray{Float64,1,Array{Float64,2},Tuple{Base.Slice{Base.OneTo{Int64}},Int64},true}, ::SubArray{Int64,1,Array{Int64,2},Tuple{Base.Slice{Base.OneTo{Int64}},Int64},true}) at /home/cossio/.julia/packages/LoopVectorization/J73NG/src/determinestrategy.jl:172
[4] solve_tilesize(::SubArray{Float64,1,Array{Float64,2},Tuple{Base.Slice{Base.OneTo{Int64}},Int64},true}, ::SubArray{Int64,1,Array{Int64,2},Tuple{Base.Slice{Base.OneTo{Int64}},Int64},true}, ::Int64, ::Int64) at /home/cossio/.julia/packages/LoopVectorization/J73NG/src/determinestrategy.jl:221
[5] solve_tilesize(::LoopVectorization.LoopSet, ::Symbol, ::Symbol, ::SubArray{Float64,1,Array{Float64,2},Tuple{Base.Slice{Base.OneTo{Int64}},Int64},true}, ::SubArray{Int64,1,Array{Int64,2},Tuple{Base.Slice{Base.OneTo{Int64}},Int64},true}) at /home/cossio/.julia/packages/LoopVectorization/J73NG/src/determinestrategy.jl:245
[6] evaluate_cost_tile(::LoopVectorization.LoopSet, ::Array{Symbol,1}) at /home/cossio/.julia/packages/LoopVectorization/J73NG/src/determinestrategy.jl:317
[7] choose_tile at /home/cossio/.julia/packages/LoopVectorization/J73NG/src/determinestrategy.jl:387 [inlined]
[8] choose_order(::LoopVectorization.LoopSet) at /home/cossio/.julia/packages/LoopVectorization/J73NG/src/determinestrategy.jl:401
[9] lower(::LoopVectorization.LoopSet) at /home/cossio/.julia/packages/LoopVectorization/J73NG/src/lowering.jl:809
[10] @avx(::LineNumberNode, ::Module, ::Any) at /home/cossio/.julia/packages/LoopVectorization/J73NG/src/constructors.jl:53
First of all, thanks to everyone who tried it out and filed an issue, commented here, or sent me a message when something didn’t work.
I’ve fixed a lot of bugs, added the broken examples to the test cases, and created new releases.
LoopVectorization’s latest release is currently 0.2.2.
Great, glad to here it!
I’ll add
@avx @. dsdt = ((1.0 + tanh(k_s * (u - u_s)) / 2.0 - s) / τ_s
as a test case. I haven’t tried it yet, but it should be fixed (the example from issue 9 is).
Also, if you really want to optimize it, it may be worth experimenting with breaking up the broadcast in a few pieces.
A function like tanh
is probably going to eat all your registers, causing spills. Evaluating pieces in separate loops may actually be faster.
Try this much simpler version:
function dot3avx(x, A, y)
M, N = size(A)
s = zero(promote_type(eltype(x), eltype(A), eltype(y)))
@avx for m ∈ 1:M, n ∈ 1:N
s += x[m] * A[m,n] * y[n]
end
s
end
Your version was written to be very generic.
However, @avx
will only work with hardware-native types, like Float32
or Float64
, so all the careful behavior like adjoint(A[i,j])
won’t do anything.
Furthermore, defining those new temps are likely to get in the way of optimizations. LoopVectorization does not (currently?) eliminate anything or simplify your expression.
The simpler the expression, the easier it is to model it, and transform it into something else that’s faster. Anything extra you add can only stop it from doing something.
It doesn’t handle branches yet, but it’d be great to handle them eventually.
Currently, if it sees the function zero
, it assumes it’s a constant, doesn’t inspect the arguments (it should be type stable anyway, so it shouldn’t be anything that changes!) and hoists it out of the loop. In your example, that would cause an undef var error since j
wasn’t defined yet.
LoopVectorization is currently ignorant of adjoint
.
Maybe I should add functions like it and identity
(adjoint would be treated identically to identity). When it is ignorant of a function, it assumes that function is expensive enough that doing fancy stuff will cause more harm than good. Unrolling +
a bunch of times is okay. Unrolling f(x) = exp(x) + sin(x)
isn’t a great idea.
Still, it shouldn’t give the second error you got, so I’ll look into it.
All the computers I have at home (where I ran the benchmarks) have avx512, where OpenBLAS performs embarrassingly badly ;).
I did run a few benchmarks, but thought presenting them would misrepresent the performance.
Would probably be a good idea to run them on my laptop.
@baggepinnen
That surprises me. vmap
doesn’t do anything special.
You can see what vmap!
actually looks like (ignoring the generated function I used to make it variadic; maybe I should’ve just used splats) on a computer with avx2:
julia> LoopVectorization.vmap_quote(2, Float64) # number of arguments, element type
quote
M = length(dest)
vdest = vectorizable(dest)
m = 0
varg_1 = vectorizable(args[1])
varg_2 = vectorizable(args[2])
for _ = 0:M >>> 2 - 1
vstore!(vdest, f(vload(Val{4}(), varg_1, m), vload(Val{4}(), varg_2, m)), m)
m += 4
end
if m != M
__mask__ = mask(Val{4}, M & 3)
vstore!(vdest, f(vload(Val{4}(), varg_1, m, __mask__), vload(Val{4}(), varg_2, m, __mask__)), m, __mask__)
end
dest
end
(vmap
just allocates the output, and then calls vmap!
)
Using Float32
will of course double the vector sizes.
That is, it just calls the function using SVec
s as arguments, and hopes code was written generically enough for that to work.
It seems (confirmed via code_native/code_llvm) that map
isn’t vectorized at all, explaining the performance difference. broadcast
of course is vectorized.
Seems like an issue with map. Probably unintentional?
MLIR is definitely exciting.
The new Fortran compiler that will be mainlined into LLVM is planning on using MLIR and a Fortran-specific dialect (FIR).
Why did I work on LoopVectorization? My primary motivation is working on improving codegen and performance of ProbabilityModels.jl, which I hope to get in a state worth looking at within the next few months.
One of the things I wanted to be able to do was source to source autodiff for broadcasts and simple loops. This requires that I have some representation of the loops I can work on. If I’m already doing that, it only makes sense to try and make them fast, too.
Looks like a great package!
- My impression from your post is that @avx only supports native float types, yet it appears to support
Int64
,Int32
,Int16
as well. Is this expected? - Is there a possibility of supporting loops which contain a mix of Int and Float types in the future?
- My understanding is that there’s no hope of accelerating non-native types like
Complex{Float64}
or user-defined types like those from DoubleFloats anytime soon. Even so, is there possibility of the@avx
macro just not doing anything in that case? I.e. not fail? With the current state, it’s harder to write a single method that generalizes to multiple types.
Thanks!
There’s no reason it shouldn’t, but I haven’t been testing it, and some parts weren’t written generically. I’ve since fixed it, but without testing I can’t make promises.
It’s also worth pointing out that AVX (without AVX2 or higher) doesn’t support SIMD integer operations. AVX2 added Int32
and Int64
SIMD support, while AVX512BW (everything with AVX512F except Xeon Phi) added Int8
and Int16
.
This means that if I wanted to guarantee good performance, I’d have to do a lot more checks. As it is, on a computer with only AVX, it would try to use integer vectors, which would be slow. On a computer with AVX2, it would still try to use vectors of 16 Int16
, but this would again be slow, because the hardware doesn’t support it.
The assembly would get ugly when tiling, with a mountain of register spills, etc.
I think some examples should already work.
If you have an example that doesn’t, please file an issue.
They can be “feature requests” as well.
These are harder, but far from hopeless.
I think the trick (as it so often is) would be to abuse multiple dispatch.
The plan would be (same idea for DoubleFloat
):
- Define
stridedpointer
methods for::AbstractArray{Complex{T}}
that return something such thatvload
andvstore
do the right thing (load/store two vectors after shuffling them). - Make sure arithmetic functions work correctly for the vector pairs (ie, for complex, one vector for the real numbers, one for the imaginary).
- Come up with a way to tell the code generation that it ought to load two registers for each of these pair-structs.
- Find a way to tell code generation that each load eats 2 registers. 4. Also tell it what the actual cost of each operation is on these types; the pattern will be a little different, so it could make poor choices.
“3.” is by far the hardest.
A simple patch fix we can do for now is have the macro check the element types of the arrays (as it already does with a promote_type
on the eltype
s), and then have an if/else check on whether it’s supported. If not, it’ll just run the loop as written.
A better fix that would let us use type information while optimizing is to implement the loop macro more like broadcasting: have it transform the loops into expression templates, and then use an @generated
function to materialize it. That function would have access to the types (like the broadcast), and could thus be much more flexible in terms of what it can handle.
For dealing with things like Complex or Dual numbers, I think a much easier solution might be to transform from Arrays of Structs to Structs of Arrays, I.e. using GitHub - JuliaArrays/StructArrays.jl: Efficient implementation of struct arrays in Julia
In my very limited understanding, those Structs of arrays would be much easier to deal with for this package.
Very interesting! Thank you!
I have been using Cartesian (to write generic tensor operations) on floats and other types (automatic differentiation). Would you say that LoopVectorization aims at replacing Cartesian? Any comments on how both packages compare?
Here’s a little demo of using StructArrays with LoopVectorization for those interested. This is a demo gemm kernel that does slightly better than BLAS (LinearAlgebra.mul!
) for Float64 inputs:
using LoopVectorization, LinearAlgebra, StructArrays, BenchmarkTools, Test
BLAS.set_num_threads(1); @show BLAS.vendor()
const MatrixFInt64 = Union{Matrix{Float64}, Matrix{Int}}
function mul_avx!(C::MatrixFInt64, A::MatrixFInt64, B::MatrixFInt64)
z = zero(eltype(C))
@avx for i ∈ 1:size(A,1), j ∈ 1:size(B,2)
Cᵢⱼ = z
for k ∈ 1:size(A,2)
Cᵢⱼ += A[i,k] * B[k,j]
end
C[i,j] = Cᵢⱼ
end
end
M, K, N = 50, 51, 52
A = randn(M, K); B = randn(K, N)
C1 = Matrix{Float64}(undef, M, N);
C2 = similar(C1);
@btime mul_avx!($C1, $A, $B)
@btime mul!( $C2, $A, $B)
@test C1 ≈ C2
which gives an output
BLAS.vendor() = :openblas64
5.594 μs (0 allocations: 0 bytes)
8.741 μs (0 allocations: 0 bytes)
Test Passed
Now, we use StructArrays to extend this to complex matrices:
function mul_add_avx!(C::MatrixFInt64, A::MatrixFInt64, B::MatrixFInt64, factor=1)
z = zero(eltype(C))
@avx for i ∈ 1:size(A,1), j ∈ 1:size(B,2)
ΔCᵢⱼ = z
for k ∈ 1:size(A,2)
ΔCᵢⱼ += A[i,k] * B[k,j]
end
C[i,j] += factor * ΔCᵢⱼ
end
end
const StructMatrixComplexFInt64 = Union{StructArray{ComplexF64,2}, StructArray{Complex{Int},2}}
function mul_avx!(C:: StructMatrixComplexFInt64, A::StructMatrixComplexFInt64, B::StructMatrixComplexFInt64)
mul_avx!( C.re, A.re, B.re) # C.re = A.re * B.re
mul_add_avx!(C.re, A.im, B.im, -1) # C.re = C.re - A.im * B.im
mul_avx!( C.im, A.re, B.im) # C.im = A.re * B.im
mul_add_avx!(C.im, A.im, B.re) # C.im = C.im + A.im * B.re
end
A = StructArray(randn(ComplexF64, M, K));
B = StructArray(randn(ComplexF64, K, N));
C1 = StructArray(Matrix{ComplexF64}(undef, M, N));
C2 = collect(similar(C1));
@btime mul_avx!($C1, $A, $B)
@btime mul!( $C2, $(collect(A)), $(collect(B))) # collect turns the StructArray into a regular Array
@test collect(C1) ≈ C2
giving
30.696 μs (0 allocations: 0 bytes)
27.582 μs (0 allocations: 0 bytes)
Test Passed
so we go from beating OpenBlas to doing slightly worse when we switch to complex matrices.
It seems that users won’t necessarily be able to use this approach for any given problem, but it works well at least for things like matrix multiply / tensor contractions. On the library side, I think it’s possible that StructArrays could be integrated into the @avx
macro in such a way that all this works normally on complex and dual numbers, but that’s more speculative.
Edit: With the new v0.3.2 update, the codegen for the mul_add_avx
function was improved so the timing for the complex struct matrix went from 30.696 µs
to 22.988
, now beating OpenBLAS.
Have you tried the 3m method? That should be faster for these sizes.
Curious effect when using @avx
: This code for the matrix-matrix multiplication has a slight edge without @avx
function mulCAB1!(C, A, B)
M, N = size(C); K = size(B,1)
C .= 0
for n in 1:N, k in 1:K
Bkn = B[k,n]
for m in 1:M
C[m,n] += A[m,k] * Bkn
end
end
return C
end
over this code
function mulCAB2!(C, A, B)
M, N = size(C); K = size(B,1)
z = zero(eltype(C))
for m ∈ 1:M, n ∈ 1:N
Cmn = z
for k ∈ 1:K
Cmn += A[m,k] * B[k,n]
end
C[m,n] = Cmn
end
return C
end
which is faster when the @avx
macro is used (in front of the outermost loop; obviously either both functions have it or neither one does).
The first version is usually recommended as the fastest in CS books.
Of course, without the @avx
macro both functions are an order of magnitude slower than when instrumented with the macro. I tested this for matrices of size 64, 32, 16, and 8. BLAS was faster only for size 64.
I’ll try to start working on proper documentation, which will hopefully make @avx
and its performance more transparent.
It’s worth noting here however that LoopVectorization uses a LoopSet
struct as its internal representation of a set of loops.
This struct contains little more than a(n unordered!) collection of the loops, as well as a collection of operations performed in the loop. These operations contain their dependencies on other operations and on loops.
@avx
on loops builds a LoopSet
, which it then converts into an expression.
@avx
on broadcasts swaps materialize(!)
for vmaterialize(!)
, which is a generated function that creates a LoopSet
, and then converts it into an expression.
The LoopSet
does not contain information on the original order of the loops, or where the operations were written (order still matters insofar as defining dependencies between operations).
However, the orders the loops can eventually be placed in can be restricted. Writing:
for m ∈ 1:M, n ∈ 1:N
Cmn = z
for k ∈ 1:z
Cmn += A[m,k] * B[k,n]
end
C[m,n] = Cmn
end
Here, because Cmn
is redefined as a constant for every m
and n
, it is taken as a hint to depend on the m
and n
loops.
Right now, that forces the m
and n
loops to be exterior to where Cmn
is ultimately placed.
Additionally, Cmn
is then summed over k
. That forces k
; the k
loop thus depends on Cmn
and must be internal with respect to Cmn
’s definition.
Thus it only considers two possible orders (exterior to interior): m
, n
, k
and n
, m
, k
.
One of these is hopefully pretty good.
If instead Cmn = C[m,n]
, it would not infer the restriction that the m
and n
loops must be external to that operation. It will instead assuming that if it reorders, it will keep rereading the correct value from memory.
Similarly, if we want to reoder the loops ourselves, we have to replace the Cmn = zero(eltype(C))
with a C .= 0
external to all the loops, and switch to using Cmn = C[m,n]
instead.
That is a transformation we are capable of, but that I haven’t implemented in LoopVectorization
yet.
All that said, given the unrestricted version of the loop, what does @avx
want to do with it?
julia> using LoopVectorization: LoopSet, choose_order
[ Info: Precompiling LoopVectorization [bdcacae8-1622-11e9-2a5c-532679323890]
julia> nkm_q = :(for n in 1:N, k in 1:K
Bkn = B[k,n]
for m in 1:M
C[m,n] += A[m,k] * Bkn
end
end);
julia> nkm_ls = LoopSet(nkm_q);
julia> choose_order(nkm_ls)
([:n, :m, :k], :m, 4, 4)
This happens to be one of the two possible orders of the more restricted version.
Note, the integers will change depending on your CPU architecture. The symbols probably wont.
What this means is that the loop order it choose is (exterior to interior): n
, m
, k
.
The second m
means that is that SIMD vectors are loaded from the m
loop. That is, adjacent elements along the m
axis will be loaded into adjacent SIMD lanes.
If a number doesn’t depend on the m
loop (like B[k,n]
) it will insteadl be broadcast across a vector, so each element is equal.
I call this the vectorized loop. The vectorized loop is unrolled by the vector width.
The reason that the nkm
loop is faster with base Julia/@simd
is because that allows the m
loop to be vectorized, while the mnk
loop cannot.
The @avx
macro can load SIMD vectors along the m
axis regardless of the loop order it actually chooses.
The two integers, in this case (U, T) = 4, 4
, refer to how the loops are unrolled. If T == -1
, that means the outermost loop is unrolled by U
(or U
* vector width, if it is also the vectorized loop).
Otherwise, the outermost loop is unrolled by T
, and the second outermost loop is unrolled by U
(or U
* vector width).
What this unrolling means in this example is that it will:
- Create
U*T
vectors ofCmn
. - It will start looping over
k
. On each iteration, it will… - Load
U
vectors fromA
. - for t = 1,…,T, it will load an element from
B
, and update theU
vectors fromCmn
that correspond to thatt
(using theU
vectors fromA
). - Finish looping, and store the
Cmn
vectors intoC
.
So on each iteration of k
, you perform a total of U+T
loads (U
vector loads from A
, T
broadcasts to fill a register with a single value from B
). However, you perform a total of U * T
multiplications and additions, without any of the multiplications-and-additions depending on any others, so that they can all take advantage of out of order execution.
That is, it should be able to keep the CPU’s fma (fused multiply-add) units nearly 100% busy.
This requires a total of U*T
registers for Cmn
, U
registers for A
, and 1 register for B
, or U*T + U + 1
. For U = T = 4
, that is 21 registers.
Computers with only AVX2
have 16 floating point registers per core, so choose_order
should instead return U = 3; T = 4
, yielding exactly 16.
It has to require less registers than the CPU has, otherwise it would have to do a bunch of extra loads and stores on each k
iteration, cratering the compute / load ratio.
So, @PetrKryslUCSD, @avx
does almost the same thing with either loop, because it doesn’t pay too much attention to the actual order you wrote them in. The reason mulCAB1!
is slower is because C .= 0
is slower than not doing that, and then within the loops, loading values from C[m,n]
is also slightly slower than simply assigning 0.
In both cases, it is able to vectorize the m
loop, while @simd
is only capable of doing that if the m
loop is the most interior loop.
Wow, that’s fantastic!
Performance-wise, it is much simpler than trying to get a macro (which doesn’t see types!) to be able to do something fast.
The ratio of performance of your complex to real example was about 5.5.
I just released a new version of LoopVectorization (v"0.3.2") that improved the code generation for your mul_add_avx!
, which now gives me a factor of about 4.2.
This is still behind the ideal 4
, but much closer!
Mind rerunning your benchmarks?
The problem was that it did not realize it could place the load from C[i,j]
in C[i,j] += factor * ΔCᵢⱼ
after the loop, because I had naively been placing all loads prior to any interior loops.
This hurt because it then counted that as an additional occupied register (changing the values of U
and T
).
Adding a check to see if it is allowed to delay operations until after the loop, and then doing so if allowed, largely solved the problem.
OpenBLAS and MKL seem to have ratios of <4. IIRC, they default to 4m, so I’m guessing that’s because of BLAS overhead being amortized.
I wouldn’t say it aims to replace Cartesian, which aims to make it easier to write generic loops.
LoopVectorization aims to make loops you have written faster.
EDIT: I updated the opening post with more benchmarks.
Wow, thanks for the quick update! I installed the update and re-ran my benchmarks, finding negligible change in the Matrix{Float64}
benchmarks ( 5.594 -> 5.603 µs
) but the StructArray{ComplexF64, 2}
went from 30.696 µs
to 22.988 μs
, so I’m observing a 4.1x increased cost for the complex matrix gemm.
Another thing that’s worth pointing out about the above gemm kernels is that they absolutely smoke LinearAlgebra.mul!
for matrices of integers. The reason is that OpenBLAS doesn’t do integer matrix multiplication, so mul!
will fall back on a generic kernel.
Using the same setup code from my post above, I find
A = rand(-100:100, M, K); B = rand(-100:100, K, N)
C1 = Matrix{eltype(A)}(undef, M, N);
C2 = similar(C1);
@btime mul_avx!($C1, $A, $B)
@btime mul!( $C2, $A, $B)
gives
28.842 μs (0 allocations: 0 bytes)
65.560 μs (6 allocations: 336 bytes)
and for complex matrices
A = StructArray(rand(-100:100, M, K) + im * rand(-100:100, M, K));
B = StructArray(rand(-100:100, K, N) + im * rand(-100:100, K, N));
C1 = StructArray(Matrix{eltype(A)}(undef, M, N));
C2 = collect(similar(C1));
@btime mul_avx!($C1, $A, $B)
@btime mul!( $C2, $(collect(A)), $(collect(B)))
gives
116.938 μs (0 allocations: 0 bytes)
232.860 μs (6 allocations: 336 bytes)
I’m sure that with an algorithmic improvement, the advantage for Complex{Int}
would be even greater.
I tried - naively it turns out - to vectorise a piece of code I’ve long wanted to optimise. A MWE is the following:
using BenchmarkTools, LoopVectorization
maxdeg = 20
nbasis = 10_000
dim = 10
basis = rand(1:(maxdeg+1), (dim, nbasis))
coeffs = rand(nbasis)
P = rand(dim, maxdeg+1)
function mvp(P, basis, coeffs::Vector{T}) where {T}
len_c = length(coeffs)
len_P = size(P, 1)
p = zero(T)
for n = 1:len_c
pn = coeffs[n]
for a = 1:len_P
pn *= P[a, basis[a, n]]
end
p += pn
end
return p
end
@btime mvp($P, $basis, $coeffs)
Using the @avx
macro on either the inner or outer loop gives me the following error:
MethodError: Cannot `convert` an object of type Expr to an object of type Union{Int64, Symbol}
It has something to do with indexing via basis[a, n]
. Can anybody explain to me (1) why my naive approach won’t work and (2) whether this code could still be vectorised?
It can be vectorized using gather instructions.
These instructions are slow, which is one reason compilers don’t often use them.
I also didn’t add support to LoopVectorization
for that yet, but that is something I very much want to support. In statistics, it’s common to write code like when you have random effects tied to categorical variables.
So I will work on supporting this with LoopVectorization shortly. I’ll update here (or on a related GitHub issue) once it does.
LoopVectorization will use LLVM’s gather
intrinsics. Note that only AVX2 and AVX512 provide hardware support/corresponding assembly instructions for gather (as far as x86 goes; GPUs support it).
The code will still work on computers without it, but will almost certainly be slow. Meaning you’d need at least AVX2.
Even with it, gather isn’t fast. But it would be well worth testing for each particular loop, so a convenient way to turn it off and on (by adding and removing @avx
) would be great.
Thanks for the explanation. I look forward to trying this.