[ANN] LoopVectorization

Okay, so I’ve made some updates to Gaius.jl.

  • Gaius.jl now supports multiplication complex matrices so long as they are StructArrays from StructArrays.jl. It’s possible that we could one day support Arrays of Structs, but that would need some big changes to LoopVectorization.jl and StructArrays are the path of least resistance.

  • I changed the way the sub-matrix blocking works to make more sense. Before, I did some weird things with how the matrices were divided into blocks, but now I do what one would expect and take in a cutoff size and then once a matrix is bigger than that size, I extract out sub-matrices that are of that size (before, I would take out sub-matrices that are half that size. Don’t ask why, because I don’t know).

  • It turns out that there’s no one sub-matrix size that works well for all input matrix sizes, so the default now adaptively chooses between submatrices are that are 32 x 32 and 48 x 48 depending on whether the input matrix is larger or smaller than 72 x 72. Making this change results in much better comparisons against OpenBLAS on my desktop machine, but I see on my crappy laptop it actually makes Gaius.jl perform worse, so I don’t know how universal these choices are.

F64_mul
C64_mul F32_mul

@Elrod I’m currently not using your const DEFAULT_BLOCK_SIZE = REGISTER_SIZE == 64 ? 128 : 104 but I think we can maybe work that back in if needed if this change tanked performance on AVX512.

8 Likes

Well, I suggested these kernels because currently I think the bar isn’t too high in terms of what BLAS or SuiteSparse offer as an alternative. I basically always do this operation in 2 steps as in A' * (B * C). So if you can beat that even if it’s far from the best implementation of this kernel, I would love to use your package immediately! The 3-arg dot method implementation in Julia supports a sparse middle matrix. Maybe I can try playing with this at some point if you don’t beat me to it :slight_smile:

2 Likes

I think the key word is “sparse”. It is hard to support fast multiplications of sparse matrices with dense matrices or vectors.

1 Like

I tried my hand at single threaded matmul in PaddedMatrices.
Benchmarking against MKL (through ccall), OpenBLAS, and Gaius (all single threaded):
Float64, natural:


Floatg64, log-log:

Float32, natural:

Float32, log-log:

Versus Gaius and generic matmul, 64 bit integer, natural scale:


64 bit integer, log-log:

32 bit integers, natural:

32 bit integer, log-log:

This was with the latest Julia master.

Benchmark script
using Gaius, StructArrays, LinearAlgebra, BenchmarkTools
using PaddedMatrices: jmul!
BLAS.set_num_threads(1); Base.Threads.nthreads()

randa(::Type{T}, dim...) where {T} = rand(T, dim...)
randa(::Type{T}, dim...) where {T <: Signed} = rand(T(-100):T(200), dim...)

const LIBDIRECTCALLJIT = "/home/chriselrod/.julia/dev/LoopVectorization/benchmark/libdcjtest.so"
istransposed(x) = false
istransposed(x::Adjoint) = true
istransposed(x::Transpose) = true
mkl_set_num_threads(N::Integer) = ccall((:set_num_threads, LIBDIRECTCALLJIT), Cvoid, (Ref{UInt32},), Ref(N % UInt32))
function mklmul!(C::AbstractVecOrMat{Float32}, A::AbstractVecOrMat{Float32}, B::AbstractVecOrMat{Float32})
    M, N = size(C); K = size(B, 1)
    ccall(
        (:sgemmjit, LIBDIRECTCALLJIT), Cvoid,
        (Ptr{Float32},Ptr{Float32},Ptr{Float32},Ref{UInt32},Ref{UInt32},Ref{UInt32},Ref{Bool},Ref{Bool}),
        parent(C), parent(A), parent(B),
        Ref(M % UInt32), Ref(K % UInt32), Ref(N % UInt32),
        Ref(istransposed(A)), Ref(istransposed(B))
    )
end
function mklmul!(C::AbstractVecOrMat{Float64}, A::AbstractVecOrMat{Float64}, B::AbstractVecOrMat{Float64})
    M, N = size(C); K = size(B, 1)
    ccall(
        (:dgemmjit, LIBDIRECTCALLJIT), Cvoid,
        (Ptr{Float64},Ptr{Float64},Ptr{Float64},Ref{UInt32},Ref{UInt32},Ref{UInt32},Ref{Bool},Ref{Bool}),
        parent(C), parent(A), parent(B),
        Ref(M % UInt32), Ref(K % UInt32), Ref(N % UInt32),
        Ref(istransposed(A)), Ref(istransposed(B))
    )
end

mkl_set_num_threads(1)

function runbench(::Type{T}) where {T}
    (StructVector ∘ map)([2, 4, 8:8:128..., round.(Int, (10:65) .^2.2)...]) do sz
        n, k, m = sz, sz, sz
        C1 = zeros(T, n, m)
        C2 = zeros(T, n, m)
        C3 = zeros(T, n, m)
        C4 = zeros(T, n, m)
        A  = randa(T, n, k)
        B  = randa(T, k, m)

        opb = @elapsed mul!(C1, A, B)
        if 2opb < BenchmarkTools.DEFAULT_PARAMETERS.seconds
            opb = min(opb, @belapsed mul!($C1, $A, $B))         #samples=100
        end
        lvb = @elapsed blocked_mul!(C2, A, B)
        if 2lvb < BenchmarkTools.DEFAULT_PARAMETERS.seconds
            lvb = min(lvb, @belapsed blocked_mul!($C2, $A, $B)) #samples=100
        end
        @assert C1 ≈ C2
        pmb = @elapsed jmul!(C3, A, B)
        if 2pmb < BenchmarkTools.DEFAULT_PARAMETERS.seconds
            pmb = min(pmb, @belapsed jmul!($C3, $A, $B))         #samples=100
        end
        @assert C1 ≈ C3
        if T <: Integer
            @show (matrix_size=sz, lvBLAS=lvb, OpenBLAS=opb, PaddedMatrices = pmb)
        else
            mklb = @elapsed mklmul!(C4, A, B)
            if 2mklb < BenchmarkTools.DEFAULT_PARAMETERS.seconds
                mklb = min(mklb, @belapsed mklmul!($C4, $A, $B))         #samples=100
            end
            @assert C1 ≈ C4
            @show (matrix_size=sz, lvBLAS=lvb, OpenBLAS=opb, PaddedMatrices = pmb, MKL = mklb)
        end
    end
end
tf64 = runbench(Float64);
tf32 = runbench(Float32);
ti64 = runbench(Int64);
ti32 = runbench(Int32);

using VegaLite, DataFrames

function plotbench(res)
    df = DataFrame(
        Size = res.matrix_size,
        Gaius = res.lvBLAS,
        PaddedMatrices = res.PaddedMatrices,
        OpenBLAS = res.OpenBLAS,
        MKL = res.MKL
    )
    dfs = stack(df, [:Gaius, :PaddedMatrices, :OpenBLAS, :MKL])
    rename!(dfs, :value => :Time) |> @vlplot(
        :line, color = :variable,
        x = {:Size, scale={type=:log}}, y = {:Time, scale={type=:log}},
        width = 900, height = 600
    )
end
function plotbenchgeneric(res)
    df = DataFrame(
        Size = res.matrix_size,
        Gaius = res.lvBLAS,
        PaddedMatrices = res.PaddedMatrices,
        GenericMatmul = res.OpenBLAS
    )
    dfs = stack(df, [:Gaius, :PaddedMatrices, :GenericMatmul])
    rename!(dfs, :value => :Time) |> @vlplot(
        :line, color = :variable,
        x = {:Size, scale={type=:log}}, y = {:Time, scale={type=:log}},
        width = 900, height = 600
    )
end

const PICTURES = "/home/chriselrod/Pictures"

save(joinpath(PICTURES, "gemmf64.png"), plotbench(tf64))
save(joinpath(PICTURES, "gemmf32.png"), plotbench(tf32))


save(joinpath(PICTURES, "gemmi64.png"), plotbenchgeneric(ti64))
save(joinpath(PICTURES, "gemmi32.png"), plotbenchgeneric(ti32))

MKL wrapper for ccall (I could probably have just ccalled it directly, but I tried to get the MKL JIT to work without success)

module jitmul

use ISO_C_BINDING
use mkl_service
implicit none

include "/opt/intel/mkl/include/mkl_direct_call.fi"
! include "/opt/intel/mkl/include/mkl_service.fi"
! include "/opt/intel/mkl/include/mkl.fi"

contains
  subroutine set_num_threads(N) bind(C, name = "set_num_threads")
    integer(C_int32_t) :: N
    call mkl_set_num_threads(N)
  end subroutine set_num_threads
  
    ! subroutine dgemmjit(C,A,B,M,K,N,alpha,beta) bind(C, name = "dgemmjit")
  subroutine sgemmjit(C,A,B,M,K,N,At,Bt) bind(C, name = "sgemmjit")
    integer(C_int32_t),  intent(in)  :: M, K, N
    integer(C_int8_t),   intent(in) :: At, Bt
    real(C_float), parameter :: alpha = 1.0e0, beta = 0.0e0
    ! real(C_float),                 intent(in)  :: alpha, beta
    real(C_float), dimension(M,K), intent(in)  :: A
    real(C_float), dimension(K,N), intent(in)  :: B
    real(C_float), dimension(M,N), intent(out) :: C
    character :: Atc, Btc
    ! call mkl_set_threading_layer(MKL_THREADING_SEQUENTIAL)
    if (At == 1_C_int8_t) then
       Atc = 'T'
    else
       Atc = 'N'
    end if
    if (Bt == 1_C_int8_t) then
       Btc = 'T'
    else
       Btc = 'N'
    end if
    call sgemm(Atc, Btc, M, N, K, alpha, A, M, B, K, beta, C, M)
  end subroutine sgemmjit
  subroutine dgemmjit(C,A,B,M,K,N,At,Bt) bind(C, name = "dgemmjit")
    integer(C_int32_t),  intent(in)  :: M, K, N
    integer(C_int8_t),   intent(in) :: At, Bt
    real(C_double), parameter :: alpha = 1.0d0, beta = 0.0d0
    ! real(C_double),                 intent(in)  :: alpha, beta
    real(C_double), dimension(M,K), intent(in)  :: A
    real(C_double), dimension(K,N), intent(in)  :: B
    real(C_double), dimension(M,N), intent(out) :: C
    character :: Atc, Btc
    ! call mkl_set_threading_layer(MKL_THREADING_SEQUENTIAL)
    if (At == 1_C_int8_t) then
       Atc = 'T'
    else
       Atc = 'N'
    end if
    if (Bt == 1_C_int8_t) then
       Btc = 'T'
    else
       Btc = 'N'
    end if
    call dgemm(Atc, Btc, M, N, K, alpha, A, M, B, K, beta, C, M)
  end subroutine dgemmjit

end module jitmul

EDIT:
I just checkout out this PR which bumps the OpenBLAS version to 0.3.9. I haven’t run many tests yet, but OpenBLAS seems faster than MKL on this CPU.

13 Likes

@Elrod, As always, really interesting work.

I understand MKL was called in Direct Mode with JIT acceelration?

Probably not. I tried, but I think at least the JIT did not work.

EDIT:
@tkf I don’t want to hijack your ThreadsX.jl thread to talk about LoopVectorization.jl, so I figured it’d be better to move that part of the discussion here (unless you think somewhere else would be more appropriate).

I am assuming that the work load per element is more-or-less constant, but perhaps I shouldn’t assume the outer most axis is much longer than Threads.nthreads(). At least one of the loops much be much longer, if threading’s overhead is going to be amortized. It’d be unfortunate if, e.g., there were two loops that were 3 and 3,000,000 iterations respectively, and LoopVectorization partitioned picked the first to partition and thread on a many-threaded CPU.
Maybe they could pass the three as a compile-time constant, in which case I could have it check to make sure loops are at least a certain length relative to Sys.CPUTHREADS or VectorizationBase.NUM_CORES. FWIW, it currently treats dynamically sized loops as being of length 1024 when making optimization decisions.

The reason I’d generally prefer to pick a relatively outer loop is to reduce overhead.
As one example, when I get around to parallelizing matrix multiplication, it will not happen within LoopVectorization (or at least, not until it models cache/memory bandwidth). Instead, PaddedMatrices.jmul! currently wraps 3 additional loops around the @avx loop to add extra tiling and pack blocks from A and B.

Maybe I should read literature to find out how others break add threading, but as a guess, perhaps @spawn on both the outermost loop, and the second-from-outer-most loop (the M and N loops)?
I’m inclined to try that, and look at literature if it’s still slower by the time I run out of ideas.

Is the idea of halve to use a recursive “cache-oblivious” approach, until the length of the iterable is below some threshold?

Probably worth considering different ways of iterating. Currently, LoopVectorization assumes you’re iterating over an AbstractUnitRange or CartesianIndices{N}, and it generates corresponding nested while loops (N of them, for the CartesianIndices).
The natural thing, if continuing that approach, would be to split the threaded loop into two nested loops (outer of nchunk, inner over chunksize), and @spawn the inner loop.

LoopVectorization currently doesn’t support non-rectangular iteration, and it also doesn’t support loop-carried dependencies; it currently assumes that it can arbitrarily rearrange the order of evaluations.
However, I would like it to eventually be able to generate kernels for things like solving triangular systems of equations, which would require lifting both of those restrictions.
It’ll take more time than I’ll be able to dedicate for some time to really figure out how to do this.

But that’s something to at least start holding in mind now.
Handling loop carried dependencies would mean that I can’t just naively carve up the iteration space. For cases like A / L (where L is a lower or upper triangular matrix, and A is an arbitrary strided matrix), there is an axis I can freely partition (rows of A). But what about something like a Cholesky decomposition?

Hmm. @avx actually converts the expression into a call to _avx_!, a generated function.
So, perhaps the approach to take would be to carve up the iteration space, and then repeatedly call @spawn _avx_!(args...). This will handle the unrolling and vecorization, as well as loops around the unrolled code.
The _avx_! function also returns any reductions (in vector form), so I could then fetch and combine them. This handles the scoping problem I was worried about. Hopefully this is better than writing to sum buffer and then loading.

Also, FWIW, LoopVectorization currently unrolls up to 3 loops (e.g., it will unroll all three loops for A' * B), and it will unroll both in the 3-arg dot product. You want to unroll the m loop to vectorize it, but unrolling the n loop by U also lets you reuse each x[m] you’ve loaded U times before replacing it.
This is critical to performance, because loading memory is the bottleneck, not arithmetic, even when the memory is aligned and fits in cache.
vfmadd is as fast as an aligned load, and twice as fast as an unaligned load. You already have 1 load from A per vfmadd, hence the importance of minimizing loads from x (loads from y get hoisted out of the loop, so they’re fine).
You also need the multiplications with y. the Julia 1.4+ implementation, lifts the multiplication by y[n] out of the loop, but I didn’t see real benefit when trying that with LoopVectorization, why is much faster over a range of sizes. The arithmetic isn’t the bottleneck.

2 Likes

Thanks for the detailed comments!

I use N loops for CartesianIndices, too (also arrays with IndexCartesian index style in general). Is your concern that halve sounds like chopping the iteration space exactly in half? I agree it would be bad as it’s not possible to use rectangular loops after halve-ing the array. But actually my definition of halve is “split roughly in half.” (It would have been better if you can use split (as done in Java) but this function name was already taken.) So, halve on x :: AbstractArray{T,4} splits axes(x, 4) in half if size(x, 4) > 1, otherwise splits axes(x, 3) in half if size(x, 3) > 1, and so on (if linear indexing is expensive).

Yeah, I was aware of the concept so originally I was hoping I can get it. But the way scheduler picks up the task looks random so I’m not sure if it is working this way. I also guess task scheduler also has to be aware of the CPU topology somehow but I haven’t heard if Julia’s task scheduler does it.

At the moment the main reason why I’m using halve as the primary mechanism for parallelism is that it’s so simple and relaxed so that I can (in principle) define this in many types of containers; strings, dicts, sets, etc. It also can capture IndexCartesian arrays as well as BitArray.

What is the scoping problem?

BTW, the return type of fetch is always Any and so I suppose you are getting the result via heap. If you want to avoid dynamic dispatch of the downstream code, creating a typed buffere does not sound crazy.

I’m not sure if I understand this part. Do you mean the code would look like roughly

for m0 in 0:Um:size(A,1)-Um, n0 in 0:Un:size(A,2)-Un
    for m1 in 1:Um, for n1 in 1:Un # unrolled
        m = m0 + m1
        n = n0 + n1
        s += x[m] * A[m,n] * y[n]
    end
end

when ignoring SIMD and the remainder loop?

This is very interesting. Thanks for the explanation. But, I still think the implementation in LinearAlgebra.jl still makes sense as y can be a MappedArrays.jl and y[n] can invoke exp or something expensive. In fact, I’d lower the mapping part to (xm, Amn, yn) -> xm * Amn * yn if I were to implement what I described. It should be easy to hoist out y[n] this way (actually, I wrote a proof-of-concept package NDReducibles.jl that does this before so I think it’s relatively easy).

Hi @elrod, @tkf, @Mason, … and all the other people doing great work on vectorization, multithreading, combining both, for array and broadcasting operations, e.g. pure BLAS Julia. Unfortunately, I have not been following very closely due to lack of time recently, but I am certainly very interested in learning more, and seeing how these tools can be combined with / improve / replace Strided.jl.

Would it be useful to have a videochat at some point to discuss (and possibly coordinate) efforts?

5 Likes

I’d love to do that.

:+1: sounds great.

Base Julia will hopefully start doing that too. You made a lot of comments in that thread, but some readers may not have seen it.

Rounding to the nearest X, where X is determined by the data type and host architecture (e.g., X would be 16 for Float64 and AVX2, because LLVM likes to unroll by 16 with that combination) may lead to better performance.

Look at base Julia and Clang in the dot product benchmarks. LLVM is very slow (compared to the other compilers and LoopVectorization) at handling remainders.
Whenever the length is a multiple of 32, Julia and Clang do well, but performance starts to degrade until it gets pretty bad by the time it’s one less than the next multiple of 32 (32, because AVX512 and Float64).

julia> using LoopVectorization, BenchmarkTools

julia> function mydot(a, b)
       s = zero(eltype(a))
       @inbounds @simd for i ∈ eachindex(a,b)
       s += a[i]*b[i]
       end
       s
       end
mydot (generic function with 1 method)

julia> function mydotavx(a, b)
       s = zero(eltype(a))
       @avx for i ∈ eachindex(a,b)
       s += a[i]*b[i]
       end
       s
       end
mydotavx (generic function with 1 method)

julia> a = rand(95); b = rand(95);

julia> @btime mydot($a, $b)
  26.542 ns (0 allocations: 0 bytes)
23.133416869375647

julia> @btime mydotavx($a, $b)
  8.871 ns (0 allocations: 0 bytes)
23.133416869375644

julia> a = rand(96); b = rand(96);

julia> @btime mydot($a, $b)
  8.167 ns (0 allocations: 0 bytes)
23.368484700717932

julia> @btime mydotavx($a, $b)
  8.661 ns (0 allocations: 0 bytes)
23.368484700717932

That’s around a 3x difference in speed. Trying to be clever about cut points can pay off.

The other (and probably bigger) concern is that it’d be nice if it could tile iteration spaces.
If operating on matrices, it’d be nice if it divided things up into blocks, rather than first dividing axis 2 until length is 1, and then starting to divide axis 2.

That doesn’t seem great. Things like that (and the overhead) are why I preferred having relatively granular parallelism.
I confess that most of my experience with parallelism has been at a very macro level, e.g. MCMC chains and/or fitting simulated data sets in parallel with the chains.
Although that can easily bust caches, so finer grained would have its advantages.

BTW,

julia> VectorizationBase.CACHE_SIZE
(32768, 1048576, 25952256)

julia> PaddedMatrices.cache_sizes()
(32768, 1048576, 1441792)

(The L3 is reported differently, because VectorizationBase returns unified, and PaddedMatrices divides it by the number of physical cores.)

You could terminate the recursion once things are small enough to fit in L1 or L2 (probably some fraction of their size), which are local to particular cores.
jmul! from padded matrices packs B into an L3 buffer, and A into an L2 buffer. So I think it’d make sense to spawn separate tasks for the packed A matrices.

L3 is often shared among cores. Exceptions include Zen3 (groups of 4 cores share L3; otherwise split) and of course multi-socket systems.

I just meant the reduction variables would each be local to their task, so I’d have to find some way to move them. Based on fetch returning Any, I think writing to a pre-allocated buffer sounds like the best solution.

With the understanding that the U_ loops are actually unrolled, it’s more like:

for n0 in 0:Un:size(A,2)-Un
    for n1 in 1:Un
        n = n0 + n1
        y_n = y[n]
    end
    for m0 in 0:Um:size(A,1)-Um, 
        for m1 in 1:Um
            m = m0 + m1
            x_m = x[m]
            for n1 in 1:Un # unrolled
                n = n0 + n1
                s += x_m * A[m,n] * y_n
            end
        end
    end
end

To be clear – it tries to minimize the amount of x[n] and y[n], so y[n] gets hoisted. What it doesn’t hoist is the multiplication by y[n].

That also addresses your concern regarding MappedArrays, which I should look into supporting. I already have a MappedStridedPointer type, but I didn’t add LoopVectorization support yet – which would involve it trying to recognize what the mapped function is.
If it can’t, it’ll assume the function is expensive.
(A problem here is that it can’t recognize anonymous functions, some of which may be cheap.)

FWIW:

using LinearAlgebra, LoopVectorization, BenchmarkTools
function jdot3avx(x, A, y)
    M, N = size(A)
    s = zero(promote_type(eltype(x), eltype(A), eltype(y)))
    @avx for n ∈ 1:N, m ∈ 1:M
        s += x[m] * A[m,n] * y[n]
    end
    s
end
function jdot3v2avx(x, A, y)
    M, N = size(A)
    s = zero(promote_type(eltype(x), eltype(A), eltype(y)))
    @avx for n ∈ 1:N
        t = zero(s)
        for m ∈ 1:M
            t += x[m] * A[m,n]
        end
        s += t * y[n]
    end
    s
end

N = 255;
x = rand(N); A = rand(N, N); y = rand(N);
@btime dot($x, $A, $y)
@btime jdot3avx($x, $A, $y)
@btime jdot3v2avx($x, $A, $y)

Seems that this does provide a performance advantage:

julia> N = 255;

julia> x = rand(N); A = rand(N, N); y = rand(N);

julia> @btime dot($x, $A, $y)
  9.784 μs (0 allocations: 0 bytes)
7869.201176634142

julia> @btime jdot3avx($x, $A, $y)
  5.282 μs (0 allocations: 0 bytes)
7869.201176634144

julia> @btime jdot3v2avx($x, $A, $y)
  4.967 μs (0 allocations: 0 bytes)
7869.201176634145

julia> N = 256;

julia> x = rand(N); A = rand(N, N); y = rand(N);

julia> @btime dot($x, $A, $y)
  3.939 μs (0 allocations: 0 bytes)
8400.47728428433

julia> @btime jdot3avx($x, $A, $y)
  3.043 μs (0 allocations: 0 bytes)
8400.477284284334

julia> @btime jdot3v2avx($x, $A, $y)
  2.960 μs (0 allocations: 0 bytes)
8400.477284284336

julia> N = 95;

julia> x = rand(N); A = rand(N, N); y = rand(N);

julia> @btime dot($x, $A, $y)
  2.655 μs (0 allocations: 0 bytes)
1117.8073372797712

julia> @btime jdot3avx($x, $A, $y)
  844.786 ns (0 allocations: 0 bytes)
1117.8073372797714

julia> @btime jdot3v2avx($x, $A, $y)
  795.720 ns (0 allocations: 0 bytes)
1117.8073372797714

julia> N = 96;

julia> x = rand(N); A = rand(N, N); y = rand(N);

julia> @btime dot($x, $A, $y)
  686.953 ns (0 allocations: 0 bytes)
1186.193903110269

julia> @btime jdot3avx($x, $A, $y)
  431.578 ns (0 allocations: 0 bytes)
1186.1939031102693

julia> @btime jdot3v2avx($x, $A, $y)
  415.603 ns (0 allocations: 0 bytes)
1186.1939031102695

CSE-ing the multiplication by y does seem to provide a definite performance benefit.
Getting LoopVectorization to do that automatically would not be trivial. It’d have to understand both distributibivity (that a * b + a * b == a * (b + c)) and associativity ((a * b) * c == a * (b * c)) to pull the multiplication by y out of the loop.
Doing that would have definite advantages though. Such optimizations are type-dependent, meaning ideally we’d just write the simplest/most generic version and let LoopVectorization handle the optimization. E.g., if A were transposed, we’d want to hoist multiplication by x instead.

julia> N = 255;

julia> x = rand(N); A = rand(N, N)'; y = rand(N);

julia> @btime dot($x, $A, $y)
  9.786 μs (0 allocations: 0 bytes)
7813.512208703809

julia> @btime jdot3avx($x, $A, $y)
  5.271 μs (0 allocations: 0 bytes)
7813.512208703809

julia> @btime jdot3v2avx($x, $A, $y)
  6.089 μs (0 allocations: 0 bytes)
7813.512208703811

julia> N = 256;

julia> x = rand(N); A = rand(N, N)'; y = rand(N);

julia> @btime dot($x, $A, $y)
  4.057 μs (0 allocations: 0 bytes)
8511.806352500511

julia> @btime jdot3avx($x, $A, $y)
  3.070 μs (0 allocations: 0 bytes)
8511.806352500516

julia> @btime jdot3v2avx($x, $A, $y)
  3.155 μs (0 allocations: 0 bytes)
8511.806352500513

julia> N = 95;

julia> x = rand(N); A = rand(N, N)'; y = rand(N);

julia> @btime dot($x, $A, $y)
  2.655 μs (0 allocations: 0 bytes)
1108.9397523730536

julia> @btime jdot3avx($x, $A, $y)
  848.059 ns (0 allocations: 0 bytes)
1108.9397523730527

julia> @btime jdot3v2avx($x, $A, $y)
  901.143 ns (0 allocations: 0 bytes)
1108.9397523730527

julia> N = 96;

julia> x = rand(N); A = rand(N, N)'; y = rand(N);

julia> @btime dot($x, $A, $y)
  693.252 ns (0 allocations: 0 bytes)
1063.8137547704807

julia> @btime jdot3avx($x, $A, $y)
  432.207 ns (0 allocations: 0 bytes)
1063.8137547704805

julia> @btime jdot3v2avx($x, $A, $y)
  432.929 ns (0 allocations: 0 bytes)
1063.8137547704805

The base version achieves that via specialized methods for Adjoint and Transposed arrays that un-transpose and swap x and y.
That’s also effectively what @avx will do when it realizes it’s more efficient to access A’s memory along rows instead of columns (it’ll swap the order of the loops, and vectorize across A’s rows and y instead of along A’s columns and x).

However, “v2” version that manually hoisted the multiplication by y doesn’t allow as much flexibility in reordering computation, resulting in its performance degrading slightly when A is transposed. Only slightly, however – performance was still better than the base version.

NDReducibles looks interesting. I’m curious about the planning functionality.
You can see what LoopVectorization’s decisions are via (@avx_debug requires you have objects in scope for it to read their types, hence the need to first define all the symbols that appear in the loops):

julia> x = rand(N); A = rand(N, N); y = rand(N); A′ = A'; s = 0.0; M, N = size(A);

julia> lsA = LoopVectorization.@avx_debug for n ∈ 1:N, m ∈ 1:M
               s += x[m] * A[m,n] * y[n]
           end;
OPS = Tuple{:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000021, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x01, 0x01),:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x02, 0x02),:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x03, 0x03),:LoopVectorization,:vmul,LoopVectorization.OperationStruct(0x0000000000000021, 0x0000000000000000, 0x0000000000000000, 0x0000000000000203, LoopVectorization.compute, 0x00, 0x04),:LoopVectorization,Symbol("##254"),LoopVectorization.OperationStruct(0x0000000000000000, 0x0000000000000000, 0x0000000000000021, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x05),:LoopVectorization,:vfmadd_fast,LoopVectorization.OperationStruct(0x0000000000000021, 0x0000000000000021, 0x0000000000000000, 0x0000000000010405, LoopVectorization.compute, 0x00, 0x05)}
ARF = Tuple{LoopVectorization.ArrayRefStruct{:A,Symbol("##vptr##_A")}(0x0000000000000101, 0x0000000000000201, 0x0000000000007020),LoopVectorization.ArrayRefStruct{:x,Symbol("##vptr##_x")}(0x0000000000000001, 0x0000000000000002, 0xffffffffffffffb0),LoopVectorization.ArrayRefStruct{:y,Symbol("##vptr##_y")}(0x0000000000000001, 0x0000000000000001, 0xfffffffffffffff0)}
AM = Tuple{0,Tuple{6},Tuple{5},Tuple{},Tuple{},Tuple{},Tuple{}}
LPSYM = Tuple{:n,:m}
LB = Tuple{VectorizationBase.StaticLowerUnitRange{1},VectorizationBase.StaticLowerUnitRange{1}}
vargs = (VectorizationBase.PackedStridedPointer{Float64,1}(Ptr{Float64} @0x000055d6d0bac000, (96,)), VectorizationBase.PackedStridedPointer{Float64,0}(Ptr{Float64} @0x00007fdfc802ce50, ()), VectorizationBase.PackedStridedPointer{Float64,0}(Ptr{Float64} @0x00007fdfc802d1d0, ()), 0.0)

julia> lsA′ = LoopVectorization.@avx_debug for n ∈ 1:N, m ∈ 1:M
               s += x[m] * A′[m,n] * y[n]
           end;
OPS = Tuple{:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000021, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x01, 0x01),:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x02, 0x02),:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x03, 0x03),:LoopVectorization,:vmul,LoopVectorization.OperationStruct(0x0000000000000021, 0x0000000000000000, 0x0000000000000000, 0x0000000000000203, LoopVectorization.compute, 0x00, 0x04),:LoopVectorization,Symbol("##254"),LoopVectorization.OperationStruct(0x0000000000000000, 0x0000000000000000, 0x0000000000000021, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x05),:LoopVectorization,:vfmadd_fast,LoopVectorization.OperationStruct(0x0000000000000021, 0x0000000000000021, 0x0000000000000000, 0x0000000000010405, LoopVectorization.compute, 0x00, 0x05)}
ARF = Tuple{LoopVectorization.ArrayRefStruct{:A′,Symbol("##vptr##_A′")}(0x0000000000000101, 0x0000000000000201, 0x0000000000000100),LoopVectorization.ArrayRefStruct{:x,Symbol("##vptr##_x")}(0x0000000000000001, 0x0000000000000002, 0x0000000000000001),LoopVectorization.ArrayRefStruct{:y,Symbol("##vptr##_y")}(0x0000000000000001, 0x0000000000000001, 0x0000000000000000)}
AM = Tuple{0,Tuple{6},Tuple{5},Tuple{},Tuple{},Tuple{},Tuple{}}
LPSYM = Tuple{:n,:m}
LB = Tuple{VectorizationBase.StaticLowerUnitRange{1},VectorizationBase.StaticLowerUnitRange{1}}
vargs = (VectorizationBase.RowMajorStridedPointer{Float64,1}(Ptr{Float64} @0x000055d6d0bac000, (96,)), VectorizationBase.PackedStridedPointer{Float64,0}(Ptr{Float64} @0x00007fdfc802ce50, ()), VectorizationBase.PackedStridedPointer{Float64,0}(Ptr{Float64} @0x00007fdfc802d1d0, ()), 0.0)

Now:

julia> LoopVectorization.choose_order(lsA)
([:n, :m], :m, :n, :m, 4, 6)

julia> LoopVectorization.choose_order(lsA′)
([:m, :n], :n, :m, :n, 4, 6)

The vector gives the order of the loops. The next two symbols give unrolled loops, and the third the vectorized loop. The two integers give unrolling factors (the vectorized loop is also unrolled by the vector width).

Another example would be matrix multiplication loops for C = A * B when both A and B are transposed. In this case, we have C => (:m,:n), A => (:k, m), B => (:n, :k).
LoopVectorization decides to loop like it’s calculation B' * A' (i.e., untranspose both A and B) and use expensive scatters to store the results in C; because there are many more reads from A and B than writes to C, it prioritizes efficient reading and lets the writes be slow. That’s the advantage of a holistic approach that also considers the operations performed on the constituent arrays.

It’s interesting to note that this is actually much faster than when only A is transposed and we have C => (:m,:n), A => (:k, m), B => (:k, :n), where the obvious decision is to vectorize :k.

5 Likes

I’d be very interested in coordinating our efforts. I’d like to ping @tim.holy, who made a few important contributions to LoopVectorization (including adding support for CartesianIndices, and making the choose_order output I just posted intelligible instead of a gensym kludge).

3 Likes

Great. With no lectures the next two weeks (for me at least), we can hopefully find some time that suits everyone.

Regarding blocking strategies, that’s one of the things I have in Strided. And there it even assumes that all the arrays involved in the operation to be performed have arbitrary (but strided) layout, i.e. so it generalizes beyond the case of having some Transpose or Adjoint objects. However, not sure how good it actually is, I did not really have a comparison, and I am certainly much less experienced with hardware considerations than all of you.

4 Likes

I see. What I wanted to comment was that it might be better to not emphasize only the outer most axis and handle all axes if you have to maximize the possibility of parallelization. Adding a mechanism for this with the extra constraint to avoid the remainder loops in all but (at most) one base cases sounds great.

Wouldn’t it make sense to increase the base case size so that the cost of @spawn and fetch is negligible? I’m guessing maybe more than 100 μs is a reasonable target (compared to ~1 μs overhead of fetch(@spawn nothing))? @btime sum($(randn(256 * 2^10 ÷ sizeof(Float64)))) takes something like 3 μs in my laptop (which has 256KiB L2 per core). So I’m wondering if it makes sense to stop using @spawn during the recursion well above reaching the size of L2 (while still using the recursion until the “block” fits the size of L2 or L1).

(Though this kind of optimization is totally beyond my experience. So I may be saying something stupid.)

Thanks for the detailed explanation! Just to check my understanding, the first two unrolled loops for n1 in 1:Un and for m1 in 1:Um are only for moving the elements y_ns and x_ms to the register while the inner most for n1 in 1:Un # unrolled is for cutting the data dependency and make it pipelining-friendly?

Thanks! There is no at the moment, though. NDReducibles was mostly meant to be a proof-of-concept for JuliaCon last year to show that we can do a lot with “functional” API like foreach. The main message was that author of the custom array/container type knows the best way to iterate over it. Back then I was also thinking it’d be nice to add/steal some ideas from Halide to NDReducibles so that it decouples loop scheduling from the algorithm. A “loop middleware” system like this may be a nice lowering target of something like LoopVectorization. But then it’s a bit too big to handle myself at the moment. (That’s why I’m trying to convince Tim Holy that this is a good idea :laughing: in Code generation and cartesian iteration · Issue #9080 · JuliaLang/julia)

Yeah, I agree with this and that’s why I think NDReducibles’ “functional” approach is more suitable as a “middleware” and not as the surface API of automatic optimizations.

I think this particular optimization is actually still doable with NDReducibles approach as it knows which array is for write at type level (it must be a ReferenceableArray from Referenceables.jl).

Of course, something like CSE is impossible and making some decisions based on the loop body is very difficult.

I have a question about transferring from @inbounds @simd to avx. In one of my package I have many small multidimensional loops (50+). Originally I added @inbounds in front of each loop, and the first execution with JIT compiling takes reasonable amount of time. However when I switched to @avx, the first time execution with JIT compiling takes significantly longer time. Is this what I should expect for now?

Yes. The reason @avx is so good is it builds a variety of possible execution plans and uses heuristics to choose which will likely be fastest (similar to what SQL does). This necessarily involves generating a lot of code and trying a variety of loop orders et. cetera.

4 Likes

Oscar_Smith is correct.

Although, interestingly Tim Holy posted some timings from LoopVectorization 0.7.0:

Test Summary:        | Pass  Total
LoopVectorization.jl | 1091   1091
237.992516 seconds (751.83 M allocations: 33.212 GiB, 2.99% gc time)
ROOT                      :  0.30 %   1889389130
GC                        :  3.02 %   18939286574
LOWERING                  :  5.50 %   34442303902
PARSING                   :  0.04 %   280687890
INFERENCE                 : 19.09 %   119621651748
CODEGEN                   : 10.44 %   65429822168
METHOD_LOOKUP_SLOW        :  0.05 %   305654374
METHOD_LOOKUP_FAST        :  3.53 %   22123410322
LLVM_OPT                  : 51.12 %   320337754472
LLVM_MODULE_FINISH        :  0.43 %   2712265790
LLVM_EMIT                 :  0.01 %   39088412
METHOD_LOOKUP_COMPILE     :  0.00 %   26002236
METHOD_MATCH              :  2.14 %   13440216708
TYPE_CACHE_LOOKUP         :  1.07 %   6735874132
TYPE_CACHE_INSERT         :  0.00 %   12747630
STAGED_FUNCTION           :  0.20 %   1283430378
MACRO_INVOCATION          :  0.02 %   105796960
AST_COMPRESS              :  1.41 %   8848165808
AST_UNCOMPRESS            :  1.45 %   9074990684
SYSIMG_LOAD               :  0.03 %   182878746
ADD_METHOD                :  0.11 %   671121294
LOAD_MODULE               :  0.03 %   162538790
INIT_MODULE               :  0.00 %   6181898

Where about half the time was spent by LLVM.
LoopVectorization is now on v0.7.7, but things probably haven’t changed too much.

We can benchmark how long the optimization step takes; the easy way (not updating based on type information):

julia> q = :(for k in 1:K
                   for n in 1:N
                       # acc = ifelse(keep === nothing, zero(T), c_re[k, n]) # same problem
                       acc = keep === nothing ? zero(T) : c_re[k, n]
                       # acc = zero(T) # this works fine
                       for c in 1:C
                           acc = acc + (a_re[k, n, c] * b_re[c, k] + a_im[k, n, c] * b_im[c, k])
                       end
                       c_re[k, n] = acc
                   end
               end);

julia> ls = LoopVectorization.LoopSet(q);

julia> @btime LoopVectorization.choose_order($ls)
  195.517 μs (6110 allocations: 551.05 KiB)
([:n, :k, :c], :n, :c, :k, 6, 2)

Julia is fast, so thankfully simulating a huge number of options doesn’t take long. :slight_smile:

Actually generating code, however, takes much longer:

julia> @benchmark LoopVectorization.lower($ls)
BenchmarkTools.Trial:
  memory estimate:  2.10 MiB
  allocs estimate:  31930
  --------------
  minimum time:     1.520 ms (0.00% GC)
  median time:      1.632 ms (0.00% GC)
  mean time:        1.933 ms (15.39% GC)
  maximum time:     8.720 ms (79.66% GC)
  --------------
  samples:          2588
  evals/sample:     1

That can probably be improved. Many things are recalculated which should instead be cached, etc.
PRs are of course welcome if anyone wants to take a stab at it. Otherwise, I’ll get to it eventually.
The list of things I’ll get to eventually is long.

One idea that I suspect would speed all of that up is, instead of manually unrolling code – which takes time to do, and then forces Julia to handle it and lower it – create short loops of length equal to unroll count, and add $(Expr(:loopinfo, (Symbol("llvm.loop.unroll.full"), 1))) so that LLVM will do the unrolling.
I wouldn’t be surprised if this also leads to better code generation, e.g. it may better understand the relationships between array indexes/offsets. This is something LLVM routinely does sub-optimally right now, e.g. it will reload a pointer from memory on every iteration of a loop, and have separate pointers in separate registers, when they’re all just at different offsets with respect to one another.
I wouldn’t be surprised if it gets that sort of thing right more often when it unrolled the loops itself.

It’s likely that LoopVectorization will eventually become more integrated with Julia’s compiler.
Here is a PR which lays some of the ground work for packages interacting with the compiler. The goal is for packages to eventually be able to define their own optimization passes. Aside from being useful for LoopVectorization, it is an exciting prospect for the future evolution of AD libraries.

Digression aside, on my computer, it takes a little over 4 seconds to using LoopVectorization, and evaluate a fast @avx function. In comparison, @inbounds and @simd would be more or less instant. A lot of that time is just calling LoopVectorization and its lowering methods themselves; compiling more functions shouldn’t take as much additional time.

But, unfortunately, yes. For now, we cannot expect to both have and eat our cakes.

I haven’t made broadcasting take advantage of this yet, but, let’s define an array type incompatible with LoopVectorization:

struct ArrayIncompatibleWithLoopVectorization{T,N} <: AbstractArray{T,N}
    data::Array{T,N}
end
Base.size(A::ArrayIncompatibleWithLoopVectorization) = size(A.data)
Base.@propagate_inbounds Base.getindex(A::ArrayIncompatibleWithLoopVectorization, i::Vararg{Int, N}) where {N} = getindex(A.data, i...)
Base.@propagate_inbounds Base.setindex!(A::ArrayIncompatibleWithLoopVectorization, v, i::Vararg{Int, N}) where {N} = setindex!(A.data, v, i...)
Base.IndexStyle(::Type{<:ArrayIncompatibleWithLoopVectorization}) = IndexLinear()

Presumably, this array’s memoy layout is something complicated. Maybe it’s a MultiScaleArray or a SparseArray. Or maybe it’s a very minimal example meant to support the AbstractArray interface in very few lines of code, and we use our imaginations to pretend it’s something more complicated than just an Array hiding underneath.

Now, let us define an array type compatible with LoopVectorization: it is a StridedArray and supports the interface:

struct ArrayCompatibleWithLoopVectorization{T,N} <: DenseArray{T,N}
    data::Array{T,N}
end
Base.size(A::ArrayCompatibleWithLoopVectorization) = size(A.data)
Base.@propagate_inbounds Base.getindex(A::ArrayCompatibleWithLoopVectorization, i::Vararg{Int, N}) where {N} = getindex(A.data, i...)
Base.@propagate_inbounds Base.setindex!(A::ArrayCompatibleWithLoopVectorization, v, i::Vararg{Int, N}) where {N} = setindex!(A.data, v, i...)
Base.IndexStyle(::Type{<:ArrayCompatibleWithLoopVectorization}) = IndexLinear()
Base.unsafe_convert(::Type{Ptr{T}}, A::ArrayCompatibleWithLoopVectorization{T}) where {T} = pointer(A.data)
Base.strides(A::ArrayCompatibleWithLoopVectorization) = strides(A.data)

Now,

using LoopVectorization
xf64 = fill(8.0, 127); xf64[1] = 1e9;
xcf64 = fill(Complex(8.0), 127); xcf64[1] = 1e9;

function selfdot(x)
    s = zero(eltype(x))
    @avx for i in eachindex(x)
        s = muladd(x[i], x[i], s) # Avoids fastmath in fallback loop.
    end
    s
end

Note that muladd is used to avoid @fastmath in the fallback loop. I’m not sold on actually having the @fastmath there. I figured that if users were willing to accept LoopVectorization’s transforms, they’d likely prefer to have LLVM’s if that fails over none. But here, @fastmath wont vectorize muladd (although @simd would).

Note the values in those arrays. (1e9)^2 = 1e18, and 1e18 + 8^2 == 1e18, so if we accumulate them with the literal associativity, we’ll just get 1e18 at the end.

Now with base arrays, we get:

julia> selfdot(xf64) - 1e18 # Vectorized
7936.0

julia> selfdot(xcf64) - 1e18 # Not vectorized
0.0 + 0.0im

The complex numbers went through the fallback loop, because they aren’t supported yet. Same applies to quaternions.
The eltype-based dispatch uses

julia> LoopVectorization.check_type(Float64)
true

julia> LoopVectorization.check_type(Complex{Float64})
false

This returns true if the type is in Union{Bool,Base.HWReal}. If you have a custom type and added all the methods needed for LoopVectorization to work, you can overload it – but that’s a tall order, which I don’t expect any time soon.
The prefered path forward to supporting things like quaternions with more than the fall back loop is the stated compiler integration: run inlining passes until those quaternions are disassembled into operations on Float64 that it does know how to work with.
If that leads no where, abort and use the fall back loop.

Additionally, looking back at our array types:

julia> selfdot(ArrayCompatibleWithLoopVectorization(xf64)) - 1e18
7936.0

julia> selfdot(ArrayIncompatibleWithLoopVectorization(xf64)) - 1e18
0.0

Now, let’s try a few wrappers, mainly Adjoint and SubArray:

julia> selfdot(view(ArrayCompatibleWithLoopVectorization(xf64), :)) - 1e18
7936.0

julia> selfdot(view(ArrayCompatibleWithLoopVectorization(xf64), 1:100)) - 1e18
6272.0

julia> selfdot(view(ArrayCompatibleWithLoopVectorization(xf64), collect(1:length(xf64)))) - 1e18
0.0

julia> selfdot(ArrayCompatibleWithLoopVectorization(xf64)') - 1e18
7936.0

julia> selfdot(view(ArrayCompatibleWithLoopVectorization(xf64), 1:100)') - 1e18
6272.0

julia> selfdot(view(ArrayCompatibleWithLoopVectorization(xf64), collect(1:length(xf64)))') - 1e18
0.0
7 Likes

Just to chime in here: I’m very pleased to report that on my machine, Julia 1.5-DEV (for which the release process started in earnest today) loads LoopVectorization in about one-tenth 75% of the time of Julia 1.4 (EDIT: original result contaminated, maybe by PackageCompiler?):

tim@diva:~/src/julia-master$ time julia --startup-file=no -e 'println(VERSION); using LoopVectorization'
1.4.2-pre.0

real	0m0.990s
user	0m1.070s
sys	0m0.212s
tim@diva:~/src/julia-master$ time julia-master --startup-file=no -e 'println(VERSION); using LoopVectorization'
1.5.0-DEV.875

real	0m0.666s
user	0m0.739s
sys	0m0.221s

I’m not yet certain where the difference lies, but a lot of work from many different people went into improving many different areas of Julia’s latencies. Among packages I’ve tested, this is by far the largest improvement I’ve seen so far, and I’m actually having some trouble believing it (more typical is twofold, which is still awesome). It’s really good news even if this particular result has some flaw with it.

Compiling @avxed functions still has a long ways to go, though it too is better:

tim@diva:~/src/julia-master$ time julia --startup-file=no -e 'println(VERSION); include("/tmp/benchlv.jl")'
1.4.2-pre.0

real	0m8.122s
user	0m8.160s
sys	0m0.256s
tim@diva:~/src/julia-master$ time julia-master --startup-file=no -e 'println(VERSION); include("/tmp/benchlv.jl")'
1.5.0-DEV.875

real	0m7.338s
user	0m7.400s
sys	0m0.229s

where benchlv.jl mimics Chris’s example:

using LoopVectorization

function multiple_unrolls_split_depchains_avx!(c_re::AbstractArray{T}, a_re, b_re, a_im, b_im, keep = nothing) where {T}
    @avx for k in 1:2
        for n in 1:2
            # acc = ifelse(keep === nothing, zero(T), c_re[k, n]) # same problem
            acc = keep === nothing ? zero(T) : c_re[k, n]
            # acc = zero(T) # this works fine
            for c in 1:2
                acc = acc + (a_re[k, n, c] * b_re[c, k] + a_im[k, n, c] * b_im[c, k])
            end
            c_re[k, n] = acc
        end
    end
    c_re
end

T = Float32
a_re, a_im = rand(T, 2, 2, 2), rand(T, 2, 2, 2);
b_re, b_im = rand(T, 2, 2), rand(T, 2, 2);
c_re_1 = ones(T, 2, 2); c_re_2 = ones(T, 2, 2);
multiple_unrolls_split_depchains_avx!(c_re_2, a_re, b_re, a_im, b_im, true) # [1 1; 1 1]

More than half the remaining compile time is LLVM, over which we have relatively limited control. So for this case the improvements that Julia itself can make may start entering territory of diminishing returns. Of course, some of that LLVM time is compiling methods in LoopVectorization itself, and one day when it moves into the compiler these will go away.

More broadly, though, we’re not done yet. In the last few days alone there has been a flurry of interest with respect to “invalidations” (where loading new code forces the recompilation of things that have been previously compiled), and over the Julia 1.6 development timeframe there is reason to expect continued improvement in this and other domains. I predict that the barriers to using the remarkable LoopVectorization will continue to fall.

14 Likes

How can we use this with Zygote / Optim for the Fused Gromov Wasserstein distance?

It’s a “meta-distance” between graphs which can help measure how different are two molecules. Antibodies are interesting molecules with >50,000 atoms and 8d features (3D position, mass, charge, atomic number, valence, b-factor)

It would be mecha-useful if we could compute a FGW distance between arbitrary macromolecules because this can be a loss function for neural network potentials and force field generators in molecular dynamics, which might blow traditional approaches to simulate molecules out of the water and disrupt molecular engineering. I’m a Pythonista and made an OpenAI gym Env which can load and data-augment (unfold and undock) infinity varieties of protein complexes and small molecules etc. Neural nets work EXTREMELY well for folding and docking when we know what atoms match what other atoms, and made super cool Gifs but I hit a wall when I switched from subtracting distance matrices to FGW (when you have two different molecules you have to figure out which atoms match up)

The problem is, if we want to compare / evolve antibodies or other proteins, and larger complexes, we run out of memory, so I’m looking for solutions to quickly compute FGW distance between massive point clouds (think big, how far can we go?) 100,000x8 point clouds or even 1,000,000, tens, hundreds of millions of atoms, the more the merrier and the faster the better IMO!

Also this could predict side effects of drugs because we could check distances from a given molecule to known toxic molecules, and we could dock therapeutics to patient antibodies to predict immunogenicity.

https://tvayer.github.io/materials/Titouan_Marseille_2019.pdf#page4

Some autodiff support is in the works. albeit very slowly as I’ve only rarely been able to find the time to dedicate to it. I’ve already been spending way more time on this than the people paying my salary would want. :wink:

However, LoopVectorization only supports strided arrays.
I do not know the details of Fused Gromov Wasserstein. Are these 100,000x8 point clouds dense matrices, or do you have things stored in some sort of graph structure?

If you’re just iterating over a matrix, doing some calculations and trying to get the minimum, then this could help.

But at large sizes, it’s likely your computation will be memory-bound. Thus, avoid passing over your data more than necessary.

1 Like