[ANN] LoopVectorization

I may be misreading the code, but it appears the comparison is never made with OpenBLAS. I think it would be of general interest to Julia programmers to see how that compares with the other options. Would it be possible to add it?

I am wondering why my measurements ([ANN] LoopVectorization - #5 by PetrKryslUCSD) are inconsistent with the first figure at the top in the OP.
The AVX-enabled matrix-matrix multiplication only beats OpenBLAS for matrix sizes below 32.
For matrix sizes bigger, OpenBLAS beats the AVX code handily.

Is it just my particular architecture (only AVX), or is there something else going on? In particular, it is unclear whether the intrinsic Fortran is BLAS or MKL or neither of these two.

Maybe you’re using multithreaded BLAS? Earlier it was mentioned that all the benchmarks were done single-threaded.

You are right, I should have posted my code. However, as you can see I did set the number of threads.

module mlatest1
using Test
using BenchmarkTools
using LoopVectorization
using LinearAlgebra


function gemmavx!(C, A, B)
	M, N = size(C); K = size(B,1)
	C .= 0
    @avx for n ∈ 1:N, k ∈ 1:K
    	Bkn = B[k,n]
        for m ∈ 1:M
            C[m,n] += A[m,k] * Bkn
        end
    end
    return C
end

function gemmsimd!(C, A, B)
    M, N = size(C); K = size(B,1)
	C .= 0
    @inbounds for n ∈ 1:N, k ∈ 1:K
    	Bkn = B[k,n]
        @simd for m ∈ 1:M
            C[m,n] += A[m,k] * Bkn
        end
    end
    return C
end

function gemm!(C, A, B)
	M, N = size(C); K = size(B,1)
	C .= 0
	for n ∈ 1:N, k ∈ 1:K
		Bkn = B[k,n]
		@inbounds for m ∈ 1:M
			C[m,n] += A[m,k] * Bkn
		end
	end
	return C
end

function gemmblas!(C, A, B)
    mul!(C, A, B)
    return C
end



function test(n)
	println("n = ", n)
	C, A, B = rand(n, n), rand(n, n), rand(n, n)
	@test norm(gemm!(C, A, B) - A * B) / norm(C) <= 1.0e-9
	@test norm(gemmsimd!(C, A, B) - A * B) / norm(C) <= 1.0e-9
	@test norm(gemmblas!(C, A, B) - A * B) / norm(C) <= 1.0e-9
	@test norm(gemmavx!(C, A, B) - A * B) / norm(C) <= 1.0e-9

	println("gemm!")
	@btime gemm!($C, $A, $B)
	println("gemmsimd!")
	@btime gemmsimd!($C, $A, $B)
	println("gemmblas!")
	@btime gemmblas!($C, $A, $B)
	println("gemmavx!")
	@btime gemmavx!($C, $A, $B)
end
end

using .mlatest1
using LinearAlgebra
LinearAlgebra.BLAS.set_num_threads(1)
mlatest1.test(2^7)
mlatest1.test(2^6)
mlatest1.test(2^5)
mlatest1.test(2^4)
mlatest1.test(2^3)

Results:

n = 128                                                                                                                                        
gemm!                                                                                                                                          
  769.401 μs (0 allocations: 0 bytes)                                                                                                          
gemmsimd!                                                                                                                                      
  769.400 μs (0 allocations: 0 bytes)                                                                                                          
gemmblas!                                                                                                                                      
  212.400 μs (0 allocations: 0 bytes)                                                                                                          
gemmavx!                                                                                                                                       
  271.500 μs (0 allocations: 0 bytes)                                                                                                          
n = 64                                                                                                                                         
gemm!                                                                                                                                          
  90.600 μs (0 allocations: 0 bytes)                                                                                                           
gemmsimd!                                                                                                                                      
  91.200 μs (0 allocations: 0 bytes)                                                                                                           
gemmblas!                                                                                                                                      
  34.099 μs (0 allocations: 0 bytes)                                                                                                           
gemmavx!                                                                                                                                       
  35.999 μs (0 allocations: 0 bytes)                                                                                                           
n = 32                                                                                                                                         
gemm!                                                                                                                                          
  13.999 μs (0 allocations: 0 bytes)                                                                                                           
gemmsimd!                                                                                                                                      
  14.299 μs (0 allocations: 0 bytes)                                                                                                           
gemmblas!                                                                                                                                      
  5.083 μs (0 allocations: 0 bytes)                                                                                                            
gemmavx!                                                                                                                                       
  4.550 μs (0 allocations: 0 bytes)                                                                                                            
n = 16                                                                                                                                         
gemm!                                                                                                                                          
  2.022 μs (0 allocations: 0 bytes)                                                                                                            
gemmsimd!                                                                                                                                      
  2.022 μs (0 allocations: 0 bytes)                                                                                                            
gemmblas!                                                                                                                                      
  1.120 μs (0 allocations: 0 bytes)                                                                                                            
gemmavx!                                                                                                                                       
  754.703 ns (0 allocations: 0 bytes)                                                                                                          
n = 8                                                                                                                                          
gemm!                                                                                                                                          
  507.325 ns (0 allocations: 0 bytes)                                                                                                          
gemmsimd!                                                                                                                                      
  649.390 ns (0 allocations: 0 bytes)                                                                                                          
gemmblas!                                                                                                                                      
  370.098 ns (0 allocations: 0 bytes)                                                                                                          
gemmavx!                                                                                                                                       
  147.250 ns (0 allocations: 0 bytes)    

Version:

Julia Version 1.4.0-DEV.653                                                                                                                    
Commit f6ca73ad0b (2019-12-22 09:03 UTC)                                                                                                       
Platform Info:                                                                                                                                 
  OS: Windows (x86_64-w64-mingw32)                                                                                                             
      Microsoft Windows [Version 10.0.18363.535]                                                                                               
  CPU: Intel(R) Core(TM) i7-2670QM CPU @ 2.20GHz:                                                                                              
              speed         user         nice          sys         idle          irq                                                           
       #1  2195 MHz   52714515            0     57776921    298672296     11381156  ticks                                                      
       #2  2195 MHz   41877640            0     21242015    346043796      4266359  ticks                                                      
       #3  2195 MHz   68650078            0     33686562    306826812      1459062  ticks                                                      
       #4  2195 MHz   45890843            0     30707765    332564843       667515  ticks                                                      
       #5  2195 MHz   65329281            0     39737968    304096203      1166953  ticks                                                      
       #6  2195 MHz   41304109            0     11175343    356684000       363546  ticks                                                      
       #7  2195 MHz   64453953            0     27187625    317521875       842328  ticks                                                      
       #8  2195 MHz   51644437            0     16455250    341063765       386968  ticks                                                      
                                                                                                                                               
  Memory: 15.896644592285156 GB (7552.83984375 MB free)                                                                                        
  Uptime: 1.034746e6 sec                                                                                                                       
  Load Avg:  0.0  0.0  0.0                                                                                                                     
  WORD_SIZE: 64                                                                                                                                
  LIBM: libopenlibm                                                                                                                            
  LLVM: libLLVM-8.0.1 (ORCJIT, sandybridge)                                                                                                    
Environment:                                                                                                                                   
  JULIA_NUM_THREADS = 4                                                                                                                        
  ACLOCAL_PATH = C:\Program Files\Git\mingw64\share\aclocal;C:\Program Files\Git\usr\share\aclocal                                             
  HOME = C:\Users\PKrysl                                                                                                                       
  EXEPATH = C:\Program Files\Git                                                                                                               
  MANPATH = C:\Program Files\Git\mingw64\local\man;C:\Program Files\Git\mingw64\share\man;C:\Program Files\Git\usr\local\man;C:\Program Files\Git\usr\share\man;C:\Program Files\Git\usr\man;C:\Program Files\Git\share\man                                                                   
  TERM = xterm                                                                                                                                 
  INFOPATH = C:\Program Files\Git\usr\local\info;C:\Program Files\Git\usr\share\info;C:\Program Files\Git\usr\info;C:\Program Files\Git\share\info                                                                                                                                            
  HOMEPATH = \Users\PKrysl                                                                                                                     
  HOMEDRIVE = C:                                                                                                                               
  PATHEXT = .COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.MSC                                                                              
  PKG_CONFIG_PATH = C:\Program Files\Git\mingw64\lib\pkgconfig;C:\Program Files\Git\mingw64\share\pkgconfig                                    
  PSMODULEPATH = C:\Program Files\WindowsPowerShell\Modules;C:\Windows\system32\WindowsPowerShell\v1.0\Modules                                 
  PATH = C:\Users\PKrysl\bin;C:\Program Files\Git\mingw64\bin;C:\Program Files\Git\usr\local\bin;C:\Program Files\Git\usr\bin;C:\Program Files\Git\usr\bin;C:\Program Files\Git\mingw64\bin;C:\Program Files\Git\usr\bin;C:\Users\PKrysl\bin;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0;C:\Windows\System32\OpenSSH;C:\Program Files\Git\cmd;C:\Program Files (x86)\Bitvise SSH Client;C:\Program Files\MiKTeX 2.9\miktex\bin\x64;C:\Users\PKrysl\AppData\Local\Microsoft\WindowsApps;C:\Program Files\Git\usr\bin\vendor_perl;C:\Program Files\Git\usr\bin\core_perl                                                                                                              
  ORIGINAL_PATH = C:\Program Files\Git\mingw64\bin;C:\Program Files\Git\usr\bin;C:\Users\PKrysl\bin;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0;C:\Windows\System32\OpenSSH;C:\Program Files\Git\cmd;C:\Program Files (x86)\Bitvise SSH Client;C:\Program Files\MiKTeX 2.9\miktex\bin\x64;C:\Users\PKrysl\AppData\Local\Microsoft\WindowsApps            
2 Likes

I was trying the following code for dot(x, A, y) (which should be available in Julia 1.4 I think), to see if @avx would improve performance here.

using LinearAlgebra, LoopVectorization

dot3(x, A, y) = dot(x, A * y)
function dot3avx(x::AbstractVector, A::AbstractMatrix, y::AbstractVector)
    (axes(x)..., axes(y)...) == axes(A) || throw(DimensionMismatch())
    T = typeof(dot3(first(x), first(A), first(y)))
    s = zero(T)
    i₁ = first(eachindex(x))
    x₁ = first(x)
    @avx for j in eachindex(y)
        yj = y[j]
        if !iszero(yj)
            temp = zero(adjoint(A[i₁,j]) * x₁)
            for i in eachindex(x)
                temp += adjoint(A[i,j]) * x[i]
            end
            s += dot(temp, yj)
        end
    end
    return s
end

But it gives an error (just from trying to evaluate the function definition):

"Don't know how to handle expression:\nif !(iszero(yj))\n    #= In[22]:10 =#\n    temp = zero(adjoint(A[i₁, j]) * x₁)\n    #= In[22]:11 =#\n    for i = eachindex(x)\n        #= In[22]:12 =#\n        temp = vmuladd(adjoint(A[i, j]), x[i], temp)\n    end\n    #= In[22]:14 =#\n    s = s + dot(temp, yj)\nend"

Stacktrace:
 [1] push!(::LoopVectorization.LoopSet, ::Expr, ::Int64) at /home/cossio/.julia/packages/LoopVectorization/J73NG/src/graphs.jl:557
 [2] add_block!(::LoopVectorization.LoopSet, ::Expr, ::Int64) at /home/cossio/.julia/packages/LoopVectorization/J73NG/src/graphs.jl:214
 [3] add_loop!(::LoopVectorization.LoopSet, ::Expr, ::Int64) at /home/cossio/.julia/packages/LoopVectorization/J73NG/src/graphs.jl:265
 [4] add_loop! at /home/cossio/.julia/packages/LoopVectorization/J73NG/src/graphs.jl:262 [inlined]
 [5] copyto! at /home/cossio/.julia/packages/LoopVectorization/J73NG/src/constructors.jl:6 [inlined]
 [6] LoopVectorization.LoopSet(::Expr) at /home/cossio/.julia/packages/LoopVectorization/J73NG/src/constructors.jl:46
 [7] @avx(::LineNumberNode, ::Module, ::Any) at /home/cossio/.julia/packages/LoopVectorization/J73NG/src/constructors.jl:53

Any suggestions?

Is it possible to take advantage of vectorization with a branch in there?

If I remove the branch I get a different error:

dot3(x, A, y) = dot(x, A * y)
function dot3avx(x::AbstractVector, A::AbstractMatrix, y::AbstractVector)
    (axes(x)..., axes(y)...) == axes(A) || throw(DimensionMismatch())
    T = typeof(dot3(first(x), first(A), first(y)))
    s = zero(T)
    i₁ = first(eachindex(x))
    x₁ = first(x)
    @avx for j in eachindex(y)
        yj = y[j]
        temp = zero(adjoint(A[i₁,j]) * x₁)
        for i in eachindex(x)
            temp += adjoint(A[i,j]) * x[i]
        end
        s += dot(temp, yj)
    end
    return s
end
DomainError with -1.033113494964142e17:
sqrt will only return a complex result if called with a complex argument. Try sqrt(Complex(x)).

Stacktrace:
 [1] throw_complex_domainerror(::Symbol, ::Float64) at ./math.jl:32
 [2] sqrt at ./math.jl:492 [inlined]
 [3] solve_tilesize(::SubArray{Float64,1,Array{Float64,2},Tuple{Base.Slice{Base.OneTo{Int64}},Int64},true}, ::SubArray{Int64,1,Array{Int64,2},Tuple{Base.Slice{Base.OneTo{Int64}},Int64},true}) at /home/cossio/.julia/packages/LoopVectorization/J73NG/src/determinestrategy.jl:172
 [4] solve_tilesize(::SubArray{Float64,1,Array{Float64,2},Tuple{Base.Slice{Base.OneTo{Int64}},Int64},true}, ::SubArray{Int64,1,Array{Int64,2},Tuple{Base.Slice{Base.OneTo{Int64}},Int64},true}, ::Int64, ::Int64) at /home/cossio/.julia/packages/LoopVectorization/J73NG/src/determinestrategy.jl:221
 [5] solve_tilesize(::LoopVectorization.LoopSet, ::Symbol, ::Symbol, ::SubArray{Float64,1,Array{Float64,2},Tuple{Base.Slice{Base.OneTo{Int64}},Int64},true}, ::SubArray{Int64,1,Array{Int64,2},Tuple{Base.Slice{Base.OneTo{Int64}},Int64},true}) at /home/cossio/.julia/packages/LoopVectorization/J73NG/src/determinestrategy.jl:245
 [6] evaluate_cost_tile(::LoopVectorization.LoopSet, ::Array{Symbol,1}) at /home/cossio/.julia/packages/LoopVectorization/J73NG/src/determinestrategy.jl:317
 [7] choose_tile at /home/cossio/.julia/packages/LoopVectorization/J73NG/src/determinestrategy.jl:387 [inlined]
 [8] choose_order(::LoopVectorization.LoopSet) at /home/cossio/.julia/packages/LoopVectorization/J73NG/src/determinestrategy.jl:401
 [9] lower(::LoopVectorization.LoopSet) at /home/cossio/.julia/packages/LoopVectorization/J73NG/src/lowering.jl:809
 [10] @avx(::LineNumberNode, ::Module, ::Any) at /home/cossio/.julia/packages/LoopVectorization/J73NG/src/constructors.jl:53

First of all, thanks to everyone who tried it out and filed an issue, commented here, or sent me a message when something didn’t work.
I’ve fixed a lot of bugs, added the broken examples to the test cases, and created new releases.
LoopVectorization’s latest release is currently 0.2.2.

Great, glad to here it!
I’ll add

@avx @. dsdt = ((1.0 + tanh(k_s * (u - u_s)) / 2.0 - s) / τ_s

as a test case. I haven’t tried it yet, but it should be fixed (the example from issue 9 is).
Also, if you really want to optimize it, it may be worth experimenting with breaking up the broadcast in a few pieces.
A function like tanh is probably going to eat all your registers, causing spills. Evaluating pieces in separate loops may actually be faster.

Try this much simpler version:

function dot3avx(x, A, y)
    M, N = size(A)
    s = zero(promote_type(eltype(x), eltype(A), eltype(y)))
    @avx for m ∈ 1:M, n ∈ 1:N
        s += x[m] * A[m,n] * y[n]
    end
    s
end

Your version was written to be very generic.
However, @avx will only work with hardware-native types, like Float32 or Float64, so all the careful behavior like adjoint(A[i,j]) won’t do anything.

Furthermore, defining those new temps are likely to get in the way of optimizations. LoopVectorization does not (currently?) eliminate anything or simplify your expression.
The simpler the expression, the easier it is to model it, and transform it into something else that’s faster. Anything extra you add can only stop it from doing something.

It doesn’t handle branches yet, but it’d be great to handle them eventually.

Currently, if it sees the function zero, it assumes it’s a constant, doesn’t inspect the arguments (it should be type stable anyway, so it shouldn’t be anything that changes!) and hoists it out of the loop. In your example, that would cause an undef var error since j wasn’t defined yet.

LoopVectorization is currently ignorant of adjoint.
Maybe I should add functions like it and identity (adjoint would be treated identically to identity). When it is ignorant of a function, it assumes that function is expensive enough that doing fancy stuff will cause more harm than good. Unrolling + a bunch of times is okay. Unrolling f(x) = exp(x) + sin(x) isn’t a great idea.

Still, it shouldn’t give the second error you got, so I’ll look into it.

All the computers I have at home (where I ran the benchmarks) have avx512, where OpenBLAS performs embarrassingly badly ;).
I did run a few benchmarks, but thought presenting them would misrepresent the performance.
Would probably be a good idea to run them on my laptop.

@baggepinnen
That surprises me. vmap doesn’t do anything special.
You can see what vmap! actually looks like (ignoring the generated function I used to make it variadic; maybe I should’ve just used splats) on a computer with avx2:

julia> LoopVectorization.vmap_quote(2, Float64) # number of arguments, element type
quote
    M = length(dest)
    vdest = vectorizable(dest)
    m = 0
    varg_1 = vectorizable(args[1])
    varg_2 = vectorizable(args[2])
    for _ = 0:M >>> 2 - 1
        vstore!(vdest, f(vload(Val{4}(), varg_1, m), vload(Val{4}(), varg_2, m)), m)
        m += 4
    end
    if m != M
        __mask__ = mask(Val{4}, M & 3)
        vstore!(vdest, f(vload(Val{4}(), varg_1, m, __mask__), vload(Val{4}(), varg_2, m, __mask__)), m, __mask__)
    end
    dest
end

(vmap just allocates the output, and then calls vmap!)
Using Float32 will of course double the vector sizes.

That is, it just calls the function using SVecs as arguments, and hopes code was written generically enough for that to work.

It seems (confirmed via code_native/code_llvm) that map isn’t vectorized at all, explaining the performance difference. broadcast of course is vectorized.
Seems like an issue with map. Probably unintentional?

MLIR is definitely exciting.
The new Fortran compiler that will be mainlined into LLVM is planning on using MLIR and a Fortran-specific dialect (FIR).

Why did I work on LoopVectorization? My primary motivation is working on improving codegen and performance of ProbabilityModels.jl, which I hope to get in a state worth looking at within the next few months.
One of the things I wanted to be able to do was source to source autodiff for broadcasts and simple loops. This requires that I have some representation of the loops I can work on. If I’m already doing that, it only makes sense to try and make them fast, too.

17 Likes

Looks like a great package!

  • My impression from your post is that @avx only supports native float types, yet it appears to support Int64, Int32, Int16 as well. Is this expected?
  • Is there a possibility of supporting loops which contain a mix of Int and Float types in the future?
  • My understanding is that there’s no hope of accelerating non-native types like Complex{Float64} or user-defined types like those from DoubleFloats anytime soon. Even so, is there possibility of the @avx macro just not doing anything in that case? I.e. not fail? With the current state, it’s harder to write a single method that generalizes to multiple types.

Thanks!

2 Likes

There’s no reason it shouldn’t, but I haven’t been testing it, and some parts weren’t written generically. I’ve since fixed it, but without testing I can’t make promises.

It’s also worth pointing out that AVX (without AVX2 or higher) doesn’t support SIMD integer operations. AVX2 added Int32 and Int64 SIMD support, while AVX512BW (everything with AVX512F except Xeon Phi) added Int8 and Int16.

This means that if I wanted to guarantee good performance, I’d have to do a lot more checks. As it is, on a computer with only AVX, it would try to use integer vectors, which would be slow. On a computer with AVX2, it would still try to use vectors of 16 Int16, but this would again be slow, because the hardware doesn’t support it.
The assembly would get ugly when tiling, with a mountain of register spills, etc.

I think some examples should already work.
If you have an example that doesn’t, please file an issue.

They can be “feature requests” as well.

These are harder, but far from hopeless.
I think the trick (as it so often is) would be to abuse multiple dispatch.

The plan would be (same idea for DoubleFloat):

  1. Define stridedpointer methods for ::AbstractArray{Complex{T}} that return something such that vload and vstore do the right thing (load/store two vectors after shuffling them).
  2. Make sure arithmetic functions work correctly for the vector pairs (ie, for complex, one vector for the real numbers, one for the imaginary).
  3. Come up with a way to tell the code generation that it ought to load two registers for each of these pair-structs.
  4. Find a way to tell code generation that each load eats 2 registers. 4. Also tell it what the actual cost of each operation is on these types; the pattern will be a little different, so it could make poor choices.

“3.” is by far the hardest.

A simple patch fix we can do for now is have the macro check the element types of the arrays (as it already does with a promote_type on the eltypes), and then have an if/else check on whether it’s supported. If not, it’ll just run the loop as written.

A better fix that would let us use type information while optimizing is to implement the loop macro more like broadcasting: have it transform the loops into expression templates, and then use an @generated function to materialize it. That function would have access to the types (like the broadcast), and could thus be much more flexible in terms of what it can handle.

5 Likes

For dealing with things like Complex or Dual numbers, I think a much easier solution might be to transform from Arrays of Structs to Structs of Arrays, I.e. using GitHub - JuliaArrays/StructArrays.jl: Efficient implementation of struct arrays in Julia

In my very limited understanding, those Structs of arrays would be much easier to deal with for this package.

4 Likes

Very interesting! Thank you!

I have been using Cartesian (to write generic tensor operations) on floats and other types (automatic differentiation). Would you say that LoopVectorization aims at replacing Cartesian? Any comments on how both packages compare?

Here’s a little demo of using StructArrays with LoopVectorization for those interested. This is a demo gemm kernel that does slightly better than BLAS (LinearAlgebra.mul!) for Float64 inputs:

using LoopVectorization, LinearAlgebra, StructArrays, BenchmarkTools, Test
BLAS.set_num_threads(1); @show BLAS.vendor()

const MatrixFInt64 = Union{Matrix{Float64}, Matrix{Int}}

function mul_avx!(C::MatrixFInt64, A::MatrixFInt64, B::MatrixFInt64)
    z = zero(eltype(C))
    @avx for i ∈ 1:size(A,1), j ∈ 1:size(B,2)
        Cᵢⱼ = z
        for k ∈ 1:size(A,2)
            Cᵢⱼ += A[i,k] * B[k,j]
        end
        C[i,j] = Cᵢⱼ
    end
end

M, K, N = 50, 51, 52
A = randn(M, K); B = randn(K, N)
C1 = Matrix{Float64}(undef, M, N);
C2 = similar(C1);

@btime mul_avx!($C1, $A, $B)
@btime mul!(    $C2, $A, $B)
@test C1 ≈ C2

which gives an output

BLAS.vendor() = :openblas64
  5.594 μs (0 allocations: 0 bytes)
  8.741 μs (0 allocations: 0 bytes)
Test Passed

Now, we use StructArrays to extend this to complex matrices:

function mul_add_avx!(C::MatrixFInt64, A::MatrixFInt64, B::MatrixFInt64, factor=1)
    z = zero(eltype(C))
    @avx for i ∈ 1:size(A,1), j ∈ 1:size(B,2)
        ΔCᵢⱼ = z
        for k ∈ 1:size(A,2)
            ΔCᵢⱼ += A[i,k] * B[k,j]
        end
        C[i,j] += factor * ΔCᵢⱼ
    end
end

const StructMatrixComplexFInt64 = Union{StructArray{ComplexF64,2}, StructArray{Complex{Int},2}}

function mul_avx!(C:: StructMatrixComplexFInt64, A::StructMatrixComplexFInt64, B::StructMatrixComplexFInt64)
    mul_avx!(  C.re, A.re, B.re)       # C.re = A.re * B.re
    mul_add_avx!(C.re, A.im, B.im, -1) # C.re = C.re - A.im * B.im
    mul_avx!(  C.im, A.re, B.im)       # C.im = A.re * B.im
    mul_add_avx!(C.im, A.im, B.re)     # C.im = C.im + A.im * B.re
end

A  = StructArray(randn(ComplexF64, M, K)); 
B  = StructArray(randn(ComplexF64, K, N));
C1 = StructArray(Matrix{ComplexF64}(undef, M, N)); 
C2 = collect(similar(C1));

@btime mul_avx!($C1, $A, $B)
@btime mul!(    $C2, $(collect(A)), $(collect(B))) # collect turns the StructArray into a regular Array
@test  collect(C1) ≈ C2

giving

  30.696 μs (0 allocations: 0 bytes)
  27.582 μs (0 allocations: 0 bytes)
Test Passed

so we go from beating OpenBlas to doing slightly worse when we switch to complex matrices.

It seems that users won’t necessarily be able to use this approach for any given problem, but it works well at least for things like matrix multiply / tensor contractions. On the library side, I think it’s possible that StructArrays could be integrated into the @avx macro in such a way that all this works normally on complex and dual numbers, but that’s more speculative.


Edit: With the new v0.3.2 update, the codegen for the mul_add_avx function was improved so the timing for the complex struct matrix went from 30.696 µs to 22.988, now beating OpenBLAS.

10 Likes

Have you tried the 3m method? That should be faster for these sizes.

Curious effect when using @avx: This code for the matrix-matrix multiplication has a slight edge without @avx

function mulCAB1!(C, A, B)
	M, N = size(C); K = size(B,1)
	C .= 0
	for n in 1:N, k in 1:K 
		Bkn = B[k,n]
		for m in 1:M
			C[m,n] += A[m,k] * Bkn
		end
	end
	return C
end

over this code

function mulCAB2!(C, A, B)
	M, N = size(C); K = size(B,1)
	z = zero(eltype(C))
	for m ∈ 1:M, n ∈ 1:N 
		Cmn = z
	    for k ∈ 1:K
	        Cmn += A[m,k] * B[k,n]
	    end
	    C[m,n] = Cmn
	end
	return C
end

which is faster when the @avx macro is used (in front of the outermost loop; obviously either both functions have it or neither one does).
The first version is usually recommended as the fastest in CS books.

Of course, without the @avx macro both functions are an order of magnitude slower than when instrumented with the macro. I tested this for matrices of size 64, 32, 16, and 8. BLAS was faster only for size 64.

1 Like

I’ll try to start working on proper documentation, which will hopefully make @avx and its performance more transparent.
It’s worth noting here however that LoopVectorization uses a LoopSet struct as its internal representation of a set of loops.
This struct contains little more than a(n unordered!) collection of the loops, as well as a collection of operations performed in the loop. These operations contain their dependencies on other operations and on loops.

@avx on loops builds a LoopSet, which it then converts into an expression.
@avx on broadcasts swaps materialize(!) for vmaterialize(!), which is a generated function that creates a LoopSet, and then converts it into an expression.

The LoopSet does not contain information on the original order of the loops, or where the operations were written (order still matters insofar as defining dependencies between operations).

However, the orders the loops can eventually be placed in can be restricted. Writing:

    for m ∈ 1:M, n ∈ 1:N 
        Cmn = z
        for k ∈ 1:z
            Cmn += A[m,k] * B[k,n]
        end
        C[m,n] = Cmn
    end

Here, because Cmn is redefined as a constant for every m and n, it is taken as a hint to depend on the m and n loops.
Right now, that forces the m and n loops to be exterior to where Cmn is ultimately placed.
Additionally, Cmn is then summed over k. That forces k; the k loop thus depends on Cmn and must be internal with respect to Cmn’s definition.

Thus it only considers two possible orders (exterior to interior): m, n, k and n, m, k.
One of these is hopefully pretty good.

If instead Cmn = C[m,n], it would not infer the restriction that the m and n loops must be external to that operation. It will instead assuming that if it reorders, it will keep rereading the correct value from memory.
Similarly, if we want to reoder the loops ourselves, we have to replace the Cmn = zero(eltype(C)) with a C .= 0 external to all the loops, and switch to using Cmn = C[m,n] instead.

That is a transformation we are capable of, but that I haven’t implemented in LoopVectorization yet.

All that said, given the unrestricted version of the loop, what does @avx want to do with it?

julia> using LoopVectorization: LoopSet, choose_order
[ Info: Precompiling LoopVectorization [bdcacae8-1622-11e9-2a5c-532679323890]

julia> nkm_q = :(for n in 1:N, k in 1:K 
                       Bkn = B[k,n]
                       for m in 1:M
                               C[m,n] += A[m,k] * Bkn
                       end
               end);

julia> nkm_ls = LoopSet(nkm_q);

julia> choose_order(nkm_ls)
([:n, :m, :k], :m, 4, 4)

This happens to be one of the two possible orders of the more restricted version.
Note, the integers will change depending on your CPU architecture. The symbols probably wont.

What this means is that the loop order it choose is (exterior to interior): n, m, k.
The second m means that is that SIMD vectors are loaded from the m loop. That is, adjacent elements along the m axis will be loaded into adjacent SIMD lanes.
If a number doesn’t depend on the m loop (like B[k,n]) it will insteadl be broadcast across a vector, so each element is equal.
I call this the vectorized loop. The vectorized loop is unrolled by the vector width.

The reason that the nkm loop is faster with base Julia/@simd is because that allows the m loop to be vectorized, while the mnk loop cannot.

The @avx macro can load SIMD vectors along the m axis regardless of the loop order it actually chooses.

The two integers, in this case (U, T) = 4, 4, refer to how the loops are unrolled. If T == -1, that means the outermost loop is unrolled by U (or U * vector width, if it is also the vectorized loop).
Otherwise, the outermost loop is unrolled by T, and the second outermost loop is unrolled by U (or U * vector width).
What this unrolling means in this example is that it will:

  1. Create U*T vectors of Cmn.
  2. It will start looping over k. On each iteration, it will…
  3. Load U vectors from A.
  4. for t = 1,…,T, it will load an element from B, and update the U vectors from Cmn that correspond to that t (using the U vectors from A).
  5. Finish looping, and store the Cmn vectors into C.

So on each iteration of k, you perform a total of U+T loads (U vector loads from A, T broadcasts to fill a register with a single value from B). However, you perform a total of U * T multiplications and additions, without any of the multiplications-and-additions depending on any others, so that they can all take advantage of out of order execution.
That is, it should be able to keep the CPU’s fma (fused multiply-add) units nearly 100% busy.

This requires a total of U*T registers for Cmn, U registers for A, and 1 register for B, or U*T + U + 1. For U = T = 4, that is 21 registers.
Computers with only AVX2 have 16 floating point registers per core, so choose_order should instead return U = 3; T = 4, yielding exactly 16.
It has to require less registers than the CPU has, otherwise it would have to do a bunch of extra loads and stores on each k iteration, cratering the compute / load ratio.

So, @PetrKryslUCSD, @avx does almost the same thing with either loop, because it doesn’t pay too much attention to the actual order you wrote them in. The reason mulCAB1! is slower is because C .= 0 is slower than not doing that, and then within the loops, loading values from C[m,n] is also slightly slower than simply assigning 0.

In both cases, it is able to vectorize the m loop, while @simd is only capable of doing that if the m loop is the most interior loop.

Wow, that’s fantastic!
Performance-wise, it is much simpler than trying to get a macro (which doesn’t see types!) to be able to do something fast.

The ratio of performance of your complex to real example was about 5.5.
I just released a new version of LoopVectorization (v"0.3.2") that improved the code generation for your mul_add_avx!, which now gives me a factor of about 4.2.
This is still behind the ideal 4, but much closer!
Mind rerunning your benchmarks?

The problem was that it did not realize it could place the load from C[i,j] in C[i,j] += factor * ΔCᵢⱼ after the loop, because I had naively been placing all loads prior to any interior loops.
This hurt because it then counted that as an additional occupied register (changing the values of U and T).
Adding a check to see if it is allowed to delay operations until after the loop, and then doing so if allowed, largely solved the problem.

OpenBLAS and MKL seem to have ratios of <4. IIRC, they default to 4m, so I’m guessing that’s because of BLAS overhead being amortized.

I wouldn’t say it aims to replace Cartesian, which aims to make it easier to write generic loops.
LoopVectorization aims to make loops you have written faster.

EDIT: I updated the opening post with more benchmarks.

14 Likes

Wow, thanks for the quick update! I installed the update and re-ran my benchmarks, finding negligible change in the Matrix{Float64}benchmarks ( 5.594 -> 5.603 µs) but the StructArray{ComplexF64, 2} went from 30.696 µs to 22.988 μs, so I’m observing a 4.1x increased cost for the complex matrix gemm.


Another thing that’s worth pointing out about the above gemm kernels is that they absolutely smoke LinearAlgebra.mul! for matrices of integers. The reason is that OpenBLAS doesn’t do integer matrix multiplication, so mul! will fall back on a generic kernel.

Using the same setup code from my post above, I find

A = rand(-100:100, M, K); B = rand(-100:100, K, N)
C1 = Matrix{eltype(A)}(undef, M, N);
C2 = similar(C1);

@btime mul_avx!($C1, $A, $B) 
@btime mul!(    $C2, $A, $B) 

gives

 28.842 μs (0 allocations: 0 bytes)
 65.560 μs (6 allocations: 336 bytes)

and for complex matrices

A  = StructArray(rand(-100:100, M, K) + im * rand(-100:100, M, K)); 
B  = StructArray(rand(-100:100, K, N) + im * rand(-100:100, K, N));
C1 = StructArray(Matrix{eltype(A)}(undef, M, N)); 
C2 = collect(similar(C1));

@btime mul_avx!($C1, $A, $B)                     
@btime mul!(    $C2, $(collect(A)), $(collect(B)))

gives

 116.938 μs (0 allocations: 0 bytes)
 232.860 μs (6 allocations: 336 bytes)

I’m sure that with an algorithmic improvement, the advantage for Complex{Int} would be even greater.

10 Likes

I tried - naively it turns out - to vectorise a piece of code I’ve long wanted to optimise. A MWE is the following:

using BenchmarkTools, LoopVectorization

maxdeg = 20
nbasis = 10_000
dim = 10
basis = rand(1:(maxdeg+1), (dim, nbasis))
coeffs = rand(nbasis)
P = rand(dim, maxdeg+1)

function mvp(P, basis, coeffs::Vector{T}) where {T}
   len_c = length(coeffs)
   len_P = size(P, 1)
   p = zero(T)
   for n = 1:len_c
      pn = coeffs[n]
      for a = 1:len_P
         pn *= P[a, basis[a, n]]
      end
      p += pn
   end
   return p
end

@btime mvp($P, $basis, $coeffs)

Using the @avx macro on either the inner or outer loop gives me the following error:

MethodError: Cannot `convert` an object of type Expr to an object of type Union{Int64, Symbol}

It has something to do with indexing via basis[a, n]. Can anybody explain to me (1) why my naive approach won’t work and (2) whether this code could still be vectorised?

1 Like

It can be vectorized using gather instructions.
These instructions are slow, which is one reason compilers don’t often use them.
I also didn’t add support to LoopVectorization for that yet, but that is something I very much want to support. In statistics, it’s common to write code like when you have random effects tied to categorical variables.

So I will work on supporting this with LoopVectorization shortly. I’ll update here (or on a related GitHub issue) once it does.

LoopVectorization will use LLVM’s gather intrinsics. Note that only AVX2 and AVX512 provide hardware support/corresponding assembly instructions for gather (as far as x86 goes; GPUs support it).
The code will still work on computers without it, but will almost certainly be slow. Meaning you’d need at least AVX2.

Even with it, gather isn’t fast. But it would be well worth testing for each particular loop, so a convenient way to turn it off and on (by adding and removing @avx) would be great.

5 Likes

Thanks for the explanation. I look forward to trying this.