Parallel matrix col/row operations give incorrect results

When doing gaussian elimination in parallel, I noticed incorrect results when I pass tuples of several matrices:

using LinearAlgebra

function add_col!(X::Matrix{tw}, w::tw, j1::Int, j2::Int)  where {tw}
    for i=1:size(X,1)  X[i,j2] += w*X[i,j1] end end
function add_row!(X::Matrix{tw}, w::tw, i1::Int, i2::Int)  where {tw}
    for j=1:size(X,2)  X[i2,j] += w*X[i1,j] end end
function add_colrow!(X::Tuple{Vararg{Matrix{tw}}}, Y::Tuple{Vararg{Matrix{tw}}}, w::tw, i1::Int, i2::Int)  where {tw}
    #= Given many composable matrices R^l <--X-- R^m <--Y-- R^n (where l,n vary but m is fixed), 
    we replace the basis element e_i1 ∈ R^m with the linear combination e_i1-w*e_i2. =#
    for t=1:length(Y)  add_row!(Y[t],  w, i1, i2) end 
    for t=1:length(X)  add_col!(X[t], -w, i2, i1) end end
function mod_col!(D::Matrix{tw}, U::Matrix{tw}, k::Int; prl::Int=1)  where {tw}
    if     prl==0                  for i=k+1:size(D,1)   add_colrow!((U,), (D,), -div(D[i,k],D[k,k]), k, i) end 
    elseif prl==1 Threads.@threads for i=k+1:size(D,1)   add_colrow!((U,), (D,), -div(D[i,k],D[k,k]), k, i) end  
    else error("prl=$prl is not supported.") end end

begin m=n=300;  
    A1=rand(Int.(-9:9),m,n);  A1[1,1]=3; A2=copy(A1);  
    U1,U2 = (Matrix{Int}(I,d,d) for d ∈ (m,m));
    mod_col!(A1, U1, 1, prl=0);   # unparallelized execution
    mod_col!(A2, U2, 1, prl=1);   # parallel execution gives different result
    println(U1==U2) 
end
false

I don’t see how this could create any data race, since distinct threads edit different cols/rows. Any ideas why this occurs? Is this perhaps a bug in Base.Threads?

Note that if in mod_col! I instead used add_colrow!((), (D,U), ...) then I would get correct results. The buggy output appears only when I edit cols of D and rows of U in parallel.

I don’t think you update different columns in the threads. j2 in add_col! is i1 in add_colrow!, which is the same k in mod_col! in all the threads.

Is there a reason you are implementing this yourself rather than using the lu function from LinearAlgebra? As a learning exercise?

It’s pretty hard to beat the performance of the LAPACK routine (which is already multithreaded).

@sgaure Argh, you’re right, thank you.

@stevengj I know BLAS/LAPACK is ultra fast. Now that I see how quickly a data race can happen, I’m impressed how those 2 libraries can avoid all such pitfalls.

However, I am implementing a multithreaded Smith normal form, which to my knowledge does not yet exist in Julia.

Also, since you mentioned it, I am really missing a method of lu! that would accept a matrix over a finite field (or even just a prime field \mathbb{Z}/p, especially \mathbb{Z}/2) and be as fast as LAPACK over floats.

Has GitHub - flintlib/flint: FLINT (Fast Library for Number Theory) or GitHub - linbox-team/fflas-ffpack: FFLAS-FFPACK - Finite Field Linear Algebra Subroutines / Package been integrated into Julia yet? I tried

using AbstractAlgebra
T=GF(2); n=10^4;   X=rand(T.(0:1),n,n);   Y=matrix_space(T,n,n)(X);
@time r,d,p,L,U = fflu(Y);

but this is much much slower that LinearAlgebra.lu!(X) over floats, even though + and * are much simpler over GF(2) than over Float64.

I don’t know about FFPACK, but I believe FLINT is integrated in GitHub - Nemocas/Nemo.jl: Julia bindings for various mathematical libraries (including flint2) which is pretty actively maintained

Usually textbook three-loop implementations of linear algebra (from matrix multiplication to LU) are close to 100x slower than optimized LAPACK/BLAS style methods for large matrices, even single-threaded. First, you have to “block” the algorithms in some fashion access memory efficiently (since memory access is otherwise far slower than arithmetic). Then, at the lowest levels you have to use registers and SIMD operations well.

So, if even if you start off with faster arithmetic, your baseline three-loop algorithms will still be far worse. But GF(2) arithmetic in AbstractAlgebra.jl is probably not faster than hardware-accelerated Float64 arithmetic, because the implementation of GF(2) operations seems like it uses generic code for any GF(p), not specialized for p=2. I don’t know if there is a faster implementation of GF(2) somewhere?

With a lot CAS-like systems, my guess is that the linear algebra is written with the assumption that arithmetic is expensive (e.g. arbitrary-precision arithmetic), so the textbook three-loop implementations are fine (the algorithms are limited by arithemetic, not memory).

1 Like

@stevengj Thank you for the explanation.

I don’t think you update different columns in the threads. j2 in add_col! is i1 in add_colrow! , which is the same k in mod_col! in all the threads.

This also solves my question (2) here: https://discourse.julialang.org/t/how-to-improve-efficiency-of-my-linear-algebra-code-matrix-equivalency/

Here are my timings for lu decompositions:

julia> using LinearAlgebra
julia> T=Float64; n=10^4;   @time X=rand(T,n,n);
  0.373731 seconds (2 allocations: 762.939 MiB)
julia> @time L,U,p = lu(X);   X[p,:] ≈ L*U
  7.881360 seconds (14 allocations: 2.235 GiB, 0.96% gc time)
true

julia> using AbstractAlgebra
julia> T=AbstractAlgebra.GF(2); n=10^4;   @time X=AbstractAlgebra.matrix_space(T,n,n)(rand(0:1,n,n));
  3.527718 seconds (6 allocations: 2.235 GiB)
julia> @time r,d,p,L,U = fflu(X);   p = invperm([p[i] for i=1:n]);   X[p,:] == L*U
47072.787102 seconds (30 allocations: 2.980 GiB, 0.00% gc time)
true

julia> using Nemo
julia> T=Nemo.GF(2); n=10^4;   @time X=Nemo.matrix(T,n,n,rand(0:1,n,n));
 41.483070 seconds (200.00 M allocations: 5.215 GiB, 31.76% gc time)
julia> @time r,p,L,U = Nemo.lu(X);   p = invperm([p[i] for i=1:n]);   X[p,:] == L*U
 15.461073 seconds (10.16 k allocations: 4.183 GiB, 0.02% gc time)
true

Seems like Flint via Nemo.jl is very performant, though still twice slower than LAPACK :cry: and causes many unnecessary allocations.

I’m wondering, how much work would it be for me to learn how to integrate Flint into my julia library? So that it would work on ordinary Matrix instead of Nemo.matrix and so that lu! would cause 0 allocations? What would be the best place to learn this? Observing Nemo.jl code? Is there some instructional for integrating C libraries into Julia?

Also, is it possible to use Tullio.jl to speed up my naive implementation of lu!? I know that Tullio.@tullio can be used to speed up matrix multiplication over custom types.

It cannot be done with zero allocations, because you need to convert your Matrix to a format that flint can work with. This is exactly what matrix(...) is doing.

A few remarks:

  1. There is a faster implementation available in Nemo:
julia> T=Nemo.GF(2); n=10^4; X=Nemo.matrix(T,n,n,rand(0:1,n,n));

julia> @time r,p,L,U = Nemo.lu(X);
 15.541413 seconds (150.01 M allocations: 15.358 GiB, 2.01% gc time)

julia> T=Nemo.Native.GF(2); n=10^4; X=Nemo.matrix(T,n,n,rand(0:1,n,n));

julia> @time r,p,L,U = Nemo.lu(X);
  8.088674 seconds (10.16 k allocations: 4.183 GiB, 0.62% gc time)
  1. There is an inplace version:
julia> T=Nemo.Native.GF(2); n=10^4; X=Nemo.matrix(T,n,n,rand(0:1,n,n));

julia> p = one(SymmetricGroup(n));

julia> @time lu!(p, X);
  6.175562 seconds (10.14 k allocations: 2.692 GiB, 0.75% gc time)
  1. You might want to have a look at GitHub - matthias314/Modulo2.jl: A Julia package to work with integers mod 2, including arrays, which computes echelon forms pretty fast (CC: @matthias314)
1 Like

Wow, those timings are impressive, finally something faster than over floats. Appreciate your info!