[ANN] LoopVectorization

Mason · January 8, 2020, 1:19am

Here’s a little demo of using StructArrays with LoopVectorization for those interested. This is a demo gemm kernel that does slightly better than BLAS (LinearAlgebra.mul!) for Float64 inputs:

using LoopVectorization, LinearAlgebra, StructArrays, BenchmarkTools, Test
BLAS.set_num_threads(1); @show BLAS.vendor()

const MatrixFInt64 = Union{Matrix{Float64}, Matrix{Int}}

function mul_avx!(C::MatrixFInt64, A::MatrixFInt64, B::MatrixFInt64)
    z = zero(eltype(C))
    @avx for i ∈ 1:size(A,1), j ∈ 1:size(B,2)
        Cᵢⱼ = z
        for k ∈ 1:size(A,2)
            Cᵢⱼ += A[i,k] * B[k,j]
        end
        C[i,j] = Cᵢⱼ
    end
end

M, K, N = 50, 51, 52
A = randn(M, K); B = randn(K, N)
C1 = Matrix{Float64}(undef, M, N);
C2 = similar(C1);

@btime mul_avx!($C1, $A, $B)
@btime mul!(    $C2, $A, $B)
@test C1 ≈ C2

which gives an output

BLAS.vendor() = :openblas64
  5.594 μs (0 allocations: 0 bytes)
  8.741 μs (0 allocations: 0 bytes)
Test Passed

Now, we use StructArrays to extend this to complex matrices:

function mul_add_avx!(C::MatrixFInt64, A::MatrixFInt64, B::MatrixFInt64, factor=1)
    z = zero(eltype(C))
    @avx for i ∈ 1:size(A,1), j ∈ 1:size(B,2)
        ΔCᵢⱼ = z
        for k ∈ 1:size(A,2)
            ΔCᵢⱼ += A[i,k] * B[k,j]
        end
        C[i,j] += factor * ΔCᵢⱼ
    end
end

const StructMatrixComplexFInt64 = Union{StructArray{ComplexF64,2}, StructArray{Complex{Int},2}}

function mul_avx!(C:: StructMatrixComplexFInt64, A::StructMatrixComplexFInt64, B::StructMatrixComplexFInt64)
    mul_avx!(  C.re, A.re, B.re)       # C.re = A.re * B.re
    mul_add_avx!(C.re, A.im, B.im, -1) # C.re = C.re - A.im * B.im
    mul_avx!(  C.im, A.re, B.im)       # C.im = A.re * B.im
    mul_add_avx!(C.im, A.im, B.re)     # C.im = C.im + A.im * B.re
end

A  = StructArray(randn(ComplexF64, M, K)); 
B  = StructArray(randn(ComplexF64, K, N));
C1 = StructArray(Matrix{ComplexF64}(undef, M, N)); 
C2 = collect(similar(C1));

@btime mul_avx!($C1, $A, $B)
@btime mul!(    $C2, $(collect(A)), $(collect(B))) # collect turns the StructArray into a regular Array
@test  collect(C1) ≈ C2

giving

  30.696 μs (0 allocations: 0 bytes)
  27.582 μs (0 allocations: 0 bytes)
Test Passed

so we go from beating OpenBlas to doing slightly worse when we switch to complex matrices.

It seems that users won’t necessarily be able to use this approach for any given problem, but it works well at least for things like matrix multiply / tensor contractions. On the library side, I think it’s possible that StructArrays could be integrated into the @avx macro in such a way that all this works normally on complex and dual numbers, but that’s more speculative.

Edit: With the new v0.3.2 update, the codegen for the mul_add_avx function was improved so the timing for the complex struct matrix went from 30.696 µs to 22.988, now beating OpenBLAS.

Topic		Replies	Views
ANN: LoopVectorization 0.12: multithreading and better handling of discontiguous memory accesses Performance	16	2169	March 17, 2021
Help Improving Performance of a Loop Performance performance , loops	15	1090	February 16, 2021
Simple Mat-Vec multiply (understanding performance, without the bugs) Performance tullio	16	3296	August 12, 2020
Julia matrix-multiplication performance Performance linearalgebra	20	8666	October 30, 2022
Sum operations between arrays Performance	21	5625	April 7, 2020

[ANN] LoopVectorization

Related topics