Parallel matrix col/row operations give incorrect results

@stevengj Thank you for the explanation.

I don’t think you update different columns in the threads. j2 in add_col! is i1 in add_colrow! , which is the same k in mod_col! in all the threads.

This also solves my question (2) here: https://discourse.julialang.org/t/how-to-improve-efficiency-of-my-linear-algebra-code-matrix-equivalency/

Here are my timings for lu decompositions:

julia> using LinearAlgebra
julia> T=Float64; n=10^4;   @time X=rand(T,n,n);
  0.373731 seconds (2 allocations: 762.939 MiB)
julia> @time L,U,p = lu(X);   X[p,:] ≈ L*U
  7.881360 seconds (14 allocations: 2.235 GiB, 0.96% gc time)
true

julia> using AbstractAlgebra
julia> T=AbstractAlgebra.GF(2); n=10^4;   @time X=AbstractAlgebra.matrix_space(T,n,n)(rand(0:1,n,n));
  3.527718 seconds (6 allocations: 2.235 GiB)
julia> @time r,d,p,L,U = fflu(X);   p = invperm([p[i] for i=1:n]);   X[p,:] == L*U
47072.787102 seconds (30 allocations: 2.980 GiB, 0.00% gc time)
true

julia> using Nemo
julia> T=Nemo.GF(2); n=10^4;   @time X=Nemo.matrix(T,n,n,rand(0:1,n,n));
 41.483070 seconds (200.00 M allocations: 5.215 GiB, 31.76% gc time)
julia> @time r,p,L,U = Nemo.lu(X);   p = invperm([p[i] for i=1:n]);   X[p,:] == L*U
 15.461073 seconds (10.16 k allocations: 4.183 GiB, 0.02% gc time)
true

Seems like Flint via Nemo.jl is very performant, though still twice slower than LAPACK :cry: and causes many unnecessary allocations.

I’m wondering, how much work would it be for me to learn how to integrate Flint into my julia library? So that it would work on ordinary Matrix instead of Nemo.matrix and so that lu! would cause 0 allocations? What would be the best place to learn this? Observing Nemo.jl code? Is there some instructional for integrating C libraries into Julia?

Also, is it possible to use Tullio.jl to speed up my naive implementation of lu!? I know that Tullio.@tullio can be used to speed up matrix multiplication over custom types.