Parallel matrix col/row operations give incorrect results

Leo_I · July 27, 2024, 4:27pm

When doing gaussian elimination in parallel, I noticed incorrect results when I pass tuples of several matrices:

using LinearAlgebra

function add_col!(X::Matrix{tw}, w::tw, j1::Int, j2::Int)  where {tw}
    for i=1:size(X,1)  X[i,j2] += w*X[i,j1] end end
function add_row!(X::Matrix{tw}, w::tw, i1::Int, i2::Int)  where {tw}
    for j=1:size(X,2)  X[i2,j] += w*X[i1,j] end end
function add_colrow!(X::Tuple{Vararg{Matrix{tw}}}, Y::Tuple{Vararg{Matrix{tw}}}, w::tw, i1::Int, i2::Int)  where {tw}
    #= Given many composable matrices R^l <--X-- R^m <--Y-- R^n (where l,n vary but m is fixed), 
    we replace the basis element e_i1 ∈ R^m with the linear combination e_i1-w*e_i2. =#
    for t=1:length(Y)  add_row!(Y[t],  w, i1, i2) end 
    for t=1:length(X)  add_col!(X[t], -w, i2, i1) end end
function mod_col!(D::Matrix{tw}, U::Matrix{tw}, k::Int; prl::Int=1)  where {tw}
    if     prl==0                  for i=k+1:size(D,1)   add_colrow!((U,), (D,), -div(D[i,k],D[k,k]), k, i) end 
    elseif prl==1 Threads.@threads for i=k+1:size(D,1)   add_colrow!((U,), (D,), -div(D[i,k],D[k,k]), k, i) end  
    else error("prl=$prl is not supported.") end end

begin m=n=300;  
    A1=rand(Int.(-9:9),m,n);  A1[1,1]=3; A2=copy(A1);  
    U1,U2 = (Matrix{Int}(I,d,d) for d ∈ (m,m));
    mod_col!(A1, U1, 1, prl=0);   # unparallelized execution
    mod_col!(A2, U2, 1, prl=1);   # parallel execution gives different result
    println(U1==U2) 
end
false

I don’t see how this could create any data race, since distinct threads edit different cols/rows. Any ideas why this occurs? Is this perhaps a bug in Base.Threads?

Note that if in mod_col! I instead used add_colrow!((), (D,U), ...) then I would get correct results. The buggy output appears only when I edit cols of D and rows of U in parallel.

sgaure · July 27, 2024, 5:40pm

I don’t think you update different columns in the threads. j2 in add_col! is i1 in add_colrow!, which is the same k in mod_col! in all the threads.

stevengj · July 27, 2024, 5:59pm

Is there a reason you are implementing this yourself rather than using the lu function from LinearAlgebra? As a learning exercise?

It’s pretty hard to beat the performance of the LAPACK routine (which is already multithreaded).

Leo_I · July 27, 2024, 7:03pm

@sgaure Argh, you’re right, thank you.

@stevengj I know BLAS/LAPACK is ultra fast. Now that I see how quickly a data race can happen, I’m impressed how those 2 libraries can avoid all such pitfalls.

However, I am implementing a multithreaded Smith normal form, which to my knowledge does not yet exist in Julia.

Also, since you mentioned it, I am really missing a method of lu! that would accept a matrix over a finite field (or even just a prime field \mathbb{Z}/p, especially \mathbb{Z}/2) and be as fast as LAPACK over floats.

Has GitHub - flintlib/flint: FLINT (Fast Library for Number Theory) or GitHub - linbox-team/fflas-ffpack: FFLAS-FFPACK - Finite Field Linear Algebra Subroutines / Package been integrated into Julia yet? I tried

using AbstractAlgebra
T=GF(2); n=10^4;   X=rand(T.(0:1),n,n);   Y=matrix_space(T,n,n)(X);
@time r,d,p,L,U = fflu(Y);

but this is much much slower that LinearAlgebra.lu!(X) over floats, even though + and * are much simpler over GF(2) than over Float64.

adienes · July 27, 2024, 7:08pm

I don’t know about FFPACK, but I believe FLINT is integrated in GitHub - Nemocas/Nemo.jl: Julia bindings for various mathematical libraries (including flint2) which is pretty actively maintained

stevengj · July 27, 2024, 7:21pm

Usually textbook three-loop implementations of linear algebra (from matrix multiplication to LU) are close to 100x slower than optimized LAPACK/BLAS style methods for large matrices, even single-threaded. First, you have to “block” the algorithms in some fashion access memory efficiently (since memory access is otherwise far slower than arithmetic). Then, at the lowest levels you have to use registers and SIMD operations well.

So, if even if you start off with faster arithmetic, your baseline three-loop algorithms will still be far worse. But GF(2) arithmetic in AbstractAlgebra.jl is probably not faster than hardware-accelerated Float64 arithmetic, because the implementation of GF(2) operations seems like it uses generic code for any GF(p), not specialized for p=2. I don’t know if there is a faster implementation of GF(2) somewhere?

With a lot CAS-like systems, my guess is that the linear algebra is written with the assumption that arithmetic is expensive (e.g. arbitrary-precision arithmetic), so the textbook three-loop implementations are fine (the algorithms are limited by arithemetic, not memory).

Leo_I · July 28, 2024, 11:58am

@stevengj Thank you for the explanation.

I don’t think you update different columns in the threads. j2 in add_col! is i1 in add_colrow! , which is the same k in mod_col! in all the threads.

This also solves my question (2) here: https://discourse.julialang.org/t/how-to-improve-efficiency-of-my-linear-algebra-code-matrix-equivalency/

Here are my timings for lu decompositions:

julia> using LinearAlgebra
julia> T=Float64; n=10^4;   @time X=rand(T,n,n);
  0.373731 seconds (2 allocations: 762.939 MiB)
julia> @time L,U,p = lu(X);   X[p,:] ≈ L*U
  7.881360 seconds (14 allocations: 2.235 GiB, 0.96% gc time)
true

julia> using AbstractAlgebra
julia> T=AbstractAlgebra.GF(2); n=10^4;   @time X=AbstractAlgebra.matrix_space(T,n,n)(rand(0:1,n,n));
  3.527718 seconds (6 allocations: 2.235 GiB)
julia> @time r,d,p,L,U = fflu(X);   p = invperm([p[i] for i=1:n]);   X[p,:] == L*U
47072.787102 seconds (30 allocations: 2.980 GiB, 0.00% gc time)
true

julia> using Nemo
julia> T=Nemo.GF(2); n=10^4;   @time X=Nemo.matrix(T,n,n,rand(0:1,n,n));
 41.483070 seconds (200.00 M allocations: 5.215 GiB, 31.76% gc time)
julia> @time r,p,L,U = Nemo.lu(X);   p = invperm([p[i] for i=1:n]);   X[p,:] == L*U
 15.461073 seconds (10.16 k allocations: 4.183 GiB, 0.02% gc time)
true

Seems like Flint via Nemo.jl is very performant, though still twice slower than LAPACK and causes many unnecessary allocations.

I’m wondering, how much work would it be for me to learn how to integrate Flint into my julia library? So that it would work on ordinary Matrix instead of Nemo.matrix and so that lu! would cause 0 allocations? What would be the best place to learn this? Observing Nemo.jl code? Is there some instructional for integrating C libraries into Julia?

Also, is it possible to use Tullio.jl to speed up my naive implementation of lu!? I know that Tullio.@tullio can be used to speed up matrix multiplication over custom types.

thofma · July 28, 2024, 4:43pm

Leo_I:

julia> using LinearAlgebra
julia> T=Float64; n=10^4;   @time X=rand(T,n,n);
  0.373731 seconds (2 allocations: 762.939 MiB)
julia> @time L,U,p = lu(X);   X[p,:] ≈ L*U
  7.881360 seconds (14 allocations: 2.235 GiB, 0.96% gc time)
true

julia> using AbstractAlgebra
julia> T=AbstractAlgebra.GF(2); n=10^4;   @time X=AbstractAlgebra.matrix_space(T,n,n)(rand(0:1,n,n));
  3.527718 seconds (6 allocations: 2.235 GiB)
julia> @time r,d,p,L,U = fflu(X);   p = invperm([p[i] for i=1:n]);   X[p,:] == L*U
47072.787102 seconds (30 allocations: 2.980 GiB, 0.00% gc time)
true

julia> using Nemo
julia> T=Nemo.GF(2); n=10^4;   @time X=Nemo.matrix(T,n,n,rand(0:1,n,n));
 41.483070 seconds (200.00 M allocations: 5.215 GiB, 31.76% gc time)
julia> @time r,p,L,U = Nemo.lu(X);   p = invperm([p[i] for i=1:n]);   X[p,:] == L*U
 15.461073 seconds (10.16 k allocations: 4.183 GiB, 0.02% gc time)
true
Seems like Flint via Nemo.jl is very performant, though still twice slower than LAPACK and causes many unnecessary allocations.

I’m wondering, how much work would it be for me to learn how to integrate Flint into my julia library? So that it would work on ordinary Matrix instead of Nemo.matrix and so that lu! would cause 0 allocations? What would be the best place to learn this? Observing Nemo.jl code? Is there some instructional for integrating C libraries into Julia?

It cannot be done with zero allocations, because you need to convert your Matrix to a format that flint can work with. This is exactly what matrix(...) is doing.

A few remarks:

There is a faster implementation available in Nemo:

julia> T=Nemo.GF(2); n=10^4; X=Nemo.matrix(T,n,n,rand(0:1,n,n));

julia> @time r,p,L,U = Nemo.lu(X);
 15.541413 seconds (150.01 M allocations: 15.358 GiB, 2.01% gc time)

julia> T=Nemo.Native.GF(2); n=10^4; X=Nemo.matrix(T,n,n,rand(0:1,n,n));

julia> @time r,p,L,U = Nemo.lu(X);
  8.088674 seconds (10.16 k allocations: 4.183 GiB, 0.62% gc time)

There is an inplace version:

julia> T=Nemo.Native.GF(2); n=10^4; X=Nemo.matrix(T,n,n,rand(0:1,n,n));

julia> p = one(SymmetricGroup(n));

julia> @time lu!(p, X);
  6.175562 seconds (10.14 k allocations: 2.692 GiB, 0.75% gc time)

You might want to have a look at GitHub - matthias314/Modulo2.jl: A Julia package to work with integers mod 2, including arrays, which computes echelon forms pretty fast (CC: @matthias314)

Leo_I · July 29, 2024, 8:26pm

Wow, those timings are impressive, finally something faster than over floats. Appreciate your info!

Topic		Replies	Views
How to improve efficiency of my linear algebra code (matrix equivalency) Performance linearalgebra , memory-allocation	33	1676	July 29, 2024
Data structures for threaded computing Performance	23	2846	October 23, 2019
Performance gotcha in linear algebra lu() General Usage performance , linearalgebra	33	3619	February 11, 2020
Parallel implementation is not very effective General Usage question , performance , parallel , multithreading	19	790	October 8, 2022
Poor performance multiplying many (large) matrices multithreaded Performance question , linearalgebra	11	2479	July 13, 2020

Parallel matrix col/row operations give incorrect results

Related topics