[ANN] LoopVectorization

Here’s a little demo of using StructArrays with LoopVectorization for those interested. This is a demo gemm kernel that does slightly better than BLAS (LinearAlgebra.mul!) for Float64 inputs:

using LoopVectorization, LinearAlgebra, StructArrays, BenchmarkTools, Test
BLAS.set_num_threads(1); @show BLAS.vendor()

const MatrixFInt64 = Union{Matrix{Float64}, Matrix{Int}}

function mul_avx!(C::MatrixFInt64, A::MatrixFInt64, B::MatrixFInt64)
    z = zero(eltype(C))
    @avx for i ∈ 1:size(A,1), j ∈ 1:size(B,2)
        Cᵢⱼ = z
        for k ∈ 1:size(A,2)
            Cᵢⱼ += A[i,k] * B[k,j]
        end
        C[i,j] = Cᵢⱼ
    end
end

M, K, N = 50, 51, 52
A = randn(M, K); B = randn(K, N)
C1 = Matrix{Float64}(undef, M, N);
C2 = similar(C1);

@btime mul_avx!($C1, $A, $B)
@btime mul!(    $C2, $A, $B)
@test C1 ≈ C2

which gives an output

BLAS.vendor() = :openblas64
  5.594 μs (0 allocations: 0 bytes)
  8.741 μs (0 allocations: 0 bytes)
Test Passed

Now, we use StructArrays to extend this to complex matrices:

function mul_add_avx!(C::MatrixFInt64, A::MatrixFInt64, B::MatrixFInt64, factor=1)
    z = zero(eltype(C))
    @avx for i ∈ 1:size(A,1), j ∈ 1:size(B,2)
        ΔCᵢⱼ = z
        for k ∈ 1:size(A,2)
            ΔCᵢⱼ += A[i,k] * B[k,j]
        end
        C[i,j] += factor * ΔCᵢⱼ
    end
end

const StructMatrixComplexFInt64 = Union{StructArray{ComplexF64,2}, StructArray{Complex{Int},2}}

function mul_avx!(C:: StructMatrixComplexFInt64, A::StructMatrixComplexFInt64, B::StructMatrixComplexFInt64)
    mul_avx!(  C.re, A.re, B.re)       # C.re = A.re * B.re
    mul_add_avx!(C.re, A.im, B.im, -1) # C.re = C.re - A.im * B.im
    mul_avx!(  C.im, A.re, B.im)       # C.im = A.re * B.im
    mul_add_avx!(C.im, A.im, B.re)     # C.im = C.im + A.im * B.re
end

A  = StructArray(randn(ComplexF64, M, K)); 
B  = StructArray(randn(ComplexF64, K, N));
C1 = StructArray(Matrix{ComplexF64}(undef, M, N)); 
C2 = collect(similar(C1));

@btime mul_avx!($C1, $A, $B)
@btime mul!(    $C2, $(collect(A)), $(collect(B))) # collect turns the StructArray into a regular Array
@test  collect(C1) ≈ C2

giving

  30.696 μs (0 allocations: 0 bytes)
  27.582 μs (0 allocations: 0 bytes)
Test Passed

so we go from beating OpenBlas to doing slightly worse when we switch to complex matrices.

It seems that users won’t necessarily be able to use this approach for any given problem, but it works well at least for things like matrix multiply / tensor contractions. On the library side, I think it’s possible that StructArrays could be integrated into the @avx macro in such a way that all this works normally on complex and dual numbers, but that’s more speculative.


Edit: With the new v0.3.2 update, the codegen for the mul_add_avx function was improved so the timing for the complex struct matrix went from 30.696 µs to 22.988, now beating OpenBLAS.

10 Likes