How to improve efficiency of my linear algebra code (matrix equivalency)

I wish to write a function decompose_diag_over_field! that, given an mxn matrix X over a field, decomposes it in-place as X=U*D*V, where D is a diagonal mxn matrix and U,V are invertible mxm and nxn matrices, with inverses Ui,Vi.

I don’t need U,V to be unitary, like in singular value decomposition. I don’t need D[i,i] to divide D[i+1,i+1], like in Smith normal form. I’m just aiming for efficient computation of matrix equivalence.

I would like the type of weights/entries tw to be generic (representing an arbitrary field), assuming only that functions +, -, *, inv, zero, one and pivot selection are defined for that type.

My current solution:

using LinearAlgebra

function swap_col!(X ::Matrix{tw}, j1 ::Int, j2 ::Int) ::Nothing  where {tw}
@inbounds for i = 1:size(X,1)
X[i,j1], X[i,j2] = X[i,j2], X[i,j1];
end
return nothing
end
function swap_row!(X ::Matrix{tw}, i1 ::Int, i2 ::Int) ::Nothing  where {tw}
@inbounds for j = 1:size(X,2)
X[i1,j], X[i2,j] = X[i2,j], X[i1,j];
end
return nothing
end
function swap_colrow!(X ::Tuple{Vararg{Matrix{tw}}}, Y ::Tuple{Vararg{Matrix{tw}}}, v1 ::Int, v2 ::Int) ::Nothing  where {tw}
#= Given composable matrices (R-linear maps) R^l <--X-- R^m <--Y-- R^n (several of them, where l,n can vary but m is fixed),
we replace basis elements e_v1,e_v2 ∈ R^m with e_v2,e_v1 and adjust all X,Y accordingly (swap v1,v2-columns in Xi and v1,v2-rows in Yj). =#
for t = 1:length(X)
swap_col!(X[t],v1,v2)
end
for t = 1:length(Y)
swap_row!(Y[t],v1,v2)
end
return nothing
end

function add_col!(X ::Matrix{tw}, a ::tw, j1::Int, j2::Int) ::Nothing  where {tw}
@inbounds @simd for i = 1:size(X,1)
X[i,j2] += a*X[i,j1]
end
return nothing
end
function add_row!(X ::Matrix{tw}, a ::tw, i1 ::Int, i2 ::Int) ::Nothing  where {tw}
@inbounds @simd for j = 1:size(X,2)
X[i2,j] += a*X[i1,j]
end
return nothing
end
function add_col!(X ::Tuple{Vararg{Matrix{tw}}}, Y ::Tuple{Vararg{Matrix{tw}}}, a ::tw, j1 ::Int, j2 ::Int) ::Nothing  where {tw}
#= Given many composable matrices R^l <--X-- R^m <--Y-- R^n (where l,n can vary but m is fixed),
we replace the basis element e_j2 ∈ R^m with the linear combination a*e_i1 + e_i2. =#
@inbounds for t = 1:length(X)
add_col!(X[t],  a, j1, j2)
end
@inbounds for t = 1:length(Y)
add_row!(Y[t], -a, j2, j1)
end
return nothing
end
function add_row!(X ::Tuple{Vararg{Matrix{tw}}}, Y ::Tuple{Vararg{Matrix{tw}}}, a ::tw, i1 ::Int, i2 ::Int) ::Nothing  where {tw}
#= Given many composable matrices (R^l <--X-- R^m <--Y-- R^n (where l,n can vary but m is fixed),
we replace the basis element e_i1 ∈ R^m with the linear combination e_i1 - a*e_i2. =#
@inbounds for t = 1:length(Y)
add_row!(Y[t], a, i1, i2)
end
@inbounds for t = 1:length(X)
end
return nothing
end

function elim_col_over_field!(X ::Matrix{tw}, U ::Matrix{tw}, Ui ::Matrix{tw}, i ::Int, j ::Int; prl ::Int =1) ::Nothing  where tw
#= After applying this function, the only nonzero entry in j-th column of the altered matrix is X[i,j]. Throughout, U*X is preserved and U*Ui=I. =#
x ::tw = inv(X[i,j] ::tw)
_0, _1 = zero(tw), one(tw)
@assert !iszero(X[i,j])
if     prl==0
for i_ = i+1:size(X,1)
X[i_,j]==_0 && continue;
add_row!((U,), (X,Ui), -X[i_,j]*x, i, i_);
X[i_,j] = _0
end
elseif prl==1
X[i_,j] == _0 && continue
add_row!((U,), (X,Ui), -X[i_,j]*x, i, i_)
X[i_,j] = _0
end
else error("prl=$prl is not supported") end return nothing end function elim_row_over_field!(X ::Matrix{tw}, V ::Matrix{tw}, Vi ::Matrix{tw}, i ::Int, j ::Int; prl ::Int =1) ::Nothing where tw #= After applying this function, the only nonzero entry in i-th row of the altered matrix is X[i,j]. Throughout, X*V is preserved and V*Vi=I. =# x ::tw = inv(X[i,j] ::tw) _0, _1 = zero(tw), one(tw) @assert !iszero(X[i,j]) if prl==0 for j_ = j+1:size(X,2) X[i,j_] == _0 && continue add_col!((Vi,X), (V,), -X[i,j_]*x, j, j_) X[i,j_] = _0 end elseif prl==1 Threads.@threads for j_ = j+1:size(X,2) X[i,j_] == _0 && continue add_col!((Vi,X), (V,), -X[i,j_]*x, j,j_) X[i,j_]=_0 end else error("prl=$prl is not supported")
end
return nothing
end

function decompose_diag_over_field!(D ::Matrix{tw}; U ::Matrix{tw} =Matrix{tw}(undef,0,0), V ::Matrix{tw} =Matrix{tw}(undef,0,0),
Ui ::Matrix{tw} =Matrix{tw}(undef,0,0), Vi ::Matrix{tw} =Matrix{tw}(undef,0,0), prl ::Int =1) ::Int  where tw
m,n = size(D)
_0, _1 = zero(tw), one(tw)
mn=min(m,n)
r = 0  # this will be the rank of X
@inbounds for k=1:mn  # transform X into a diagonal matrix
i,j = loc_max(D,k,k)  # i,j=location of pivot
iszero(D[i,j]) && break;
r = k
swap_colrow!((Vi,D), (V,), k, j)
swap_colrow!((U,), (D,Ui), k, i)  # move pivot to position k,k
elim_col_over_field!(D, U, Ui, k, k; prl=prl)
elim_row_over_field!(D, V, Vi, k, k; prl=prl)
end
return r
end

function loc_max(X ::Matrix{tw}, i0 ::Int =1, j0 ::Int =1) ::Tuple{Int,Int}  where {tw}
#= Return the location in X where the nonzero entry with the smallest value is. This is used to select the best pivot for Gaussian elimination. =#
m,n = size(X);
i_, j_, d_ = i0, j0, abs(X[i0,j0])
@inbounds for i=i0:m, j=j0:n
d = abs(X[i,j])
if (!iszero(X[i,j]) && (iszero(X[i_,j_]) || d_<d))
i_, j_ = i, j;
d_ = d;
end
end
return i_, j_
end


Notice that if I am interested only in some of U,V,Ui,Vi, then I can omit those that aren’t needed from the function call and Matrix{tw}(undef,0,0) is used as a dummy argument, so that the computation finishes faster.

Trying out my implementation:

function create_matrices(m, n, vals)
X = rand(vals,(m,n));
U = Matrix{eltype(X)}(I,m,m)
V = Matrix{eltype(X)}(I,n,n)
return X, copy(X), U, V, copy(U), copy(V)
end

julia> X,D,U,V,Ui,Vi = create_matrices(10^3, 10^3, -2:0.00001:2);
julia> @time decompose_diag_over_field!(D,U=U,V=V,Ui=Ui,Vi=Vi,prl=0);
12.390055 seconds (3 allocations: 112 bytes)
julia> maximum(abs.(X-U*D*V))
5.262457136723242e-14

julia> X,D,U,V,Ui,Vi = create_matrices(10^3, 10^3, -2:0.00001:2);
julia> @time decompose_diag_over_field!(D,U=U,V=V,Ui=Ui,Vi=Vi,prl=1);
6.160460 seconds (148.09 k allocations: 14.468 MiB)
julia> maximum(abs.(X-U*D*V))
73.91626375378561

julia> X,_,_,_,_,_   = create_matrices(10^3, 10^3, -2:0.00001:2);
julia> @time U,D,V=svd(X);  D=diagm(D);
0.354754 seconds (16 allocations: 45.899 MiB, 1.90% gc time)
julia> maximum(abs.(X-U*D*V'))
4.596323321948148e-14


Questions:
(1) There should be 0 allocations caused by my functions, yet there are 2M. Why?? How can I get rid of them?
Edit: Since I replaced Vector{Matrix{tw}} with Tuple{Vararg{Matrix{tw}}}, the unparallelized code has minimal allocations.

(2) Why is the parallelized version (prl=1) incorrect? All threads read from the same col/row, but write into different cols/rows, so there should be no data race.

(3) How is svd so incredibly fast and precise? I tried going through julia/stdlib/LinearAlgebra/src/svd.jl at master · JuliaLang/julia · GitHub but it is a bit overwhelming. I’d appreciate if anyone could give me an outline of how linear algebra operations can be made more efficient. Are there similar tricks (for matrix equivalency) when we compute over finite fields, like \mathbb{Z}_2?

Note 1: I wish to keep the above functions in-place/mutating, because in my case, U,V,Ui,Vi will not always start as identity matrices.
Note 2: Forgive my condensed style of writing. It is easier for me to comprehend code like that. Also, forgive me that docstrings are inside of functions and end is on the same line. Those are habits from Python that I prefer.

Your code is very hard to parse, but

[Vi,D], [V]


this will allocate

1 Like

I added more space and line-breaks, hope it is better now.

I thought [Vi,D] is just a vector of pointers to matrices, not copies of matrices. Should I make a view to not allocate?

What would be the best way to pass a collection of existing matrices, without allocations?

There’s LinearAlgebra.rank

That does not work for finite fields. Besides, I need U,V,Ui,Vi, not just rank.

1 Like

Could you get rid of all the extra semicolons, put end keywords on their own lines, and make long lines shorter? I expect you’d get more people looking at your code that way.

A few comments: This is really tough to read. Personally, I wouldn’t even try to look closely at it unless I copied it and reformatted it on my end. I may be more opinionated than some about avoiding long lines, but even for those who don’t mind them, putting so many unrelated things on a single line is probably going to make it harder for people to help.

From what I can see, it looks like you are deliberately selecting small pivots. This will be numerically unstable in floating point. You want to use the largest available pivot. The smallest pivot is the worst choice in this case.

This decomposition could be computed using LU factorization with complete pivoting after which you will have a decomposition

P_1 A P_2^T = L U = \begin{bmatrix} L_{11} & 0 \\ L_{12} & L_{22} \end{bmatrix} \begin{bmatrix} U_{11} & U_{12} \\ 0 & 0 \end{bmatrix}

where the nonzero pivots form the diagonal of U_{11}. If you pull out those pivots into a diagonal matrix D_1 you have

A = P_1^T \begin{bmatrix} L_{11} & 0 \\ L_{12} & L_{22} \end{bmatrix} \begin{bmatrix} D_1 & 0 \\ 0 & 0 \end{bmatrix} \begin{bmatrix} D_1^{-1}U_{11} & D_1^{-1}U_{12} \\ 0 & I \end{bmatrix} P_2.

This is a decomposition of the form you want. In floating point you won’t typically get exactly zero pivots so there will likely be no block U_{22}=0 and U_{11} will be n\times n. You might want to truncate everything when the pivots become too small. If you are working in a field where operations are exact, you could expect to get exact zeros in U_{22}. Personally I’d formulate the algorithm this way instead of trying to do both row and column operations. You could then compute inverses after the fact without much trouble, if you really need them.

In terms of efficiency, it typically comes down to formulating your algorithm to be cache friendly by using matrix-matrix multiplication with optimized level-3 BLAS operations. In theory that should be helpful even in other fields, but I don’t know what’s out there in terms of efficient BLAS implementations for things other than floating point. Still, I think you would see significant benefits from restructuring your code to work with blocks and use matrix-matrix multiplication even if your matrix multiplication is not highly optimized. It does more work with data you can keep in cache. This is typically not a small difference in performance.

Edit: This might be a high bar on readability on your end, but if you can read Fortran, here is a simple LU with complete pivoting. An optimized version using level-3 BLAS would be quite a bit more work and it looks like that isn’t in LAPACK.

1 Like

@nsajko I did all of that. Hopefully, now the code is more appealing.

@mstewart I reorganized the code. Hopefully it is good now.
Thank you for the remark about the pivot: taking the pivot with largest absolute value does indeed give better accuracy!!! I will change this in my original post.

I need to think a bit more about your comments about LU decomposition and matrix multiplication.

What about my questions (1) and (2)? Any idea where all the allocations are coming from? And why is the parallelized version wrong?

1 Like

For the allocations, note that creating vectors like [Vi,D] or just [V] almost always allocates. Can you replace them with tuples? (I tried running your code but got UndefVarError: loc_min not defined).

“Over a field” means non-floating point, right? (floating point is not a field – e.g. addition is not associative)

The only invariant needed should be the rank. Since you’re in a field, there is no consideration of numerical stability or the like.

You may need to consider stuff like denominator sizes if your field is Rational{BigInt}-style, i.e. infinite. But you should probably let go all hope of a generic algorithm that handles such subtleties (last I remember, computing rank of integer matrices over the rationals in poly time is not entirely trivial for this reason).

PS. If you deal with a rational function field, then consider polynomial identity testing (Schwartz-Zippel-Lemma) to deal with that blowup.

3 Likes

The edits help quite a bit.

You still have a reference to loc_min in there and it also looks like loc_max is still looking for small pivots. Also, I don’t think the function pivot_value is appropriate for stability. For stability you want the pivot of largest absolute value. Using a pivot that maximizes or minimizes abs(abs(w)-1) isn’t going to be stable in general. Changing pivot_value and loc_max to

function pivot_value(w ::tw) ::Float64  where {tw<:Union{Float16,Float32,Float64,ComplexF16,ComplexF32,ComplexF64}}
return abs(w)   # This is the distance from a real/complex entry w to ±1/S^1.
end

function loc_max(X ::Matrix{tw}, i0 ::Int =1, j0 ::Int =1) ::Tuple{Int,Int}  where {tw}
#= Return the location in X where the nonzero entry with the smallest value is. This is used to select the best pivot for Gaussian elimination. =#
m,n = size(X);
i_, j_, d = i0, j0, pivot_value(X[i0,j0])
@inbounds for i=i0:m, j=j0:n
if pivot_value(X[i,j])>d
i_, j_ = i, j;
d = pivot_value(X[i,j]);
end
end
return i_, j_
end


seems to stabilize things when I run it, as I would expect. Of course for a finite field, you could use any nonzero pivot, but you clearly have already planned for that with the pivot_value function. You could just return 1 for any nonzero value.

On the allocations, the only thing that stands out to me is what has already been pointed out. I’d put those matrices into a tuple instead of a vector and see where you stand on allocations after that.

In terms of using threads, most dense matrix code incorporates that into the matrix-matrix multiply. This is probably the last thing I’d worry about at this point. If it were me, I’d go with LU factorization, which I think you should be able to make faster than what you have. The column elimination steps really aren’t necessary. Then I’d think about blocking, which for floating point would exploit threaded BLAS using LinearAlgebra.mul!. You would probably get a factor of 10 speedup from that. Then I’d worry about matrix multiplication for finite fields, which might be a good place to think about using @threads.

1 Like

I edited the code in the opening post. As @sijo suggested, I replaced Vector{Matrix{tw}} with ˙Tuple{Vararg{Matrix{tw}}}˙ and in the unicore code, the allocations are now indeed minimal. This is not the case with the multithreaded code.

@foobar_lv2 Technically, floats do not constitute a field, but they are an approximation for the field of real/complex numbers, so I count them in.
My most important fields that I have in mind: Mod{p} \approx \mathbb{Z}_p, Rational{Int} \approx \mathbb{Q}, Float64 \approx \mathbb{R}, ComplexF64 \approx \mathbb{C}, \mathbb{Z}_{(p)} (localized ring, as a subfield of \mathbb{Q}), Rational{Polynomial{Mod{p},:t}} \approx \mathbb{Z}_p(t), … Perhaps the most important case for me at the moment is \mathbb{Z_2}.

Rank is certainly not the only invariant needed. In homological algebra, I don’t have just one matrix but a whole diagram of them, and I need to simultaneously do row/column operations on them.

@mstewart Before I start implementing the LU decomposition over a generic field, may I please clear up the outline of your proposed solution. Will it be possible to, given X,U,V,Ui,Vi, where X is rectangular, mutate X into a diagonal matrix, and correspondingly edit U,V,Ui,Vi?

How would this be achieved? I don’t understand your proposal. After I construct L and U, how do I get to the diagonal matrix?

My use-case is that I have a chain complex. I need to diagonalize the boundary matrices one by one, and mutate the accompanying homotopy equivalences…

This is nonsense. Floats approximate reals in the sense of small pertubations. Linear independence and rank are not stable under small pertubations (rank is lower semi-continuous, not continuous!). Arbitrarily small generic pertubations of a matrix will always lead to a full rank matrix.

Therefore rank or equivalence, in the linear algebra sense, are not concepts that are in any way applicable to floating point matrices.

Instead, you can haggle about spectral gaps:

help?> LinearAlgebra.rank
rank(A::AbstractMatrix; atol::Real=0, rtol::Real=atol>0 ? 0 : n*ϵ)
rank(A::AbstractMatrix, rtol::Real)

Compute the rank of a matrix by counting how many singular values of A have magnitude greater than max(atol, rtol*σ₁) where σ₁ is
A's largest singular value. atol and rtol are the absolute and relative tolerances, respectively. The default relative tolerance
is n*ϵ, where n is the size of the smallest dimension of A, and ϵ is the eps of the element type of A.


This is an entirely different conceptual world than ranks over fields! In order to think about this, you should study computational linear algebra and functional analysis.

Over finite fields, your job is simple: Do the standard rank computation you know from intro lectures. Nothing difficult.

Over fields of characteristic 0, especially Rational{BigInt} style, you need to be very careful during your computation, in order to avoid exponential (!) slowdowns. The reason is that multiplication/addition is not constant time anymore, but instead depends on the size (number of digits) of numerator/denominator. This is a pain, but you can find papers that deal with it. If you ask, me or somebody else here will point you to the relevant literature.

For function fields, and especially multivariate function fields like Rational{Polynomial{Mod{p}, :t, :s}} it is afaiu impossible to avoid the potential blowup that your entries have exponentially many nonzero coefficients. In some cases you can solve this via probabilistic methods: Plug in random assignments to your :t, :s and compute the rank over \mathbb{Z}_p. This gives you a lower bound on the actual rank. Under certain conditions, the Schwartz-Zippel Lemma (polynomial identity testing) gives you good confidence that this is the actual rank. If your field has low characteristic (e.g. \mathbb{Z}_2) then you need to do a field extension to some big enough GF(p,n) (otherwise there are not enough random choices available!). If you ask, me or somebody else here will point you to the relevant literature.

These are all rank / linear independence / matrix equivalency problems with completely different flavor that require completely different algorithms.

PS. After reconsideration, the right way of doing ranks in Rational{BigInt}-style fields in julia is almost surely to instead do the rank over \mathbb{Z}_p for some random 64 or 128 bit prime p. This is because BigInt in julia sucks, performance-wise. Unfortunately I cannot point you to the literature on that technique, not my field, sorry (pun not intended).

2 Likes

Oh, I didn’t see that, this is fun. Persistent homology perchance?

Generally I very strongly suggest using geometry / homotopy before doing any algebra. This is typically much faster: The fundamental problem is not often not that the matrices are hard to reduce, it’s that they are so freaking big that you don’t want to ever materialize them into memory. So a good approach uses homotopy (e.g. discrete morse) where possible, and only uses linear algebra on the remainder (e.g. the critical simplices when doing discrete morse).

But for more input, you’d need to comment more.

PS. Also look at Uli Bauer’s Ripser / oblivious rank computation. It’s far from optimal, but it is pretty good and gives a hint at the difference between “generic” rank computations and homological rank computations. (the algorithm would suck like hell for generic rank computations, but it performs remarkably well for the chain complexes that come up in real-life persistent homology)

1 Like

Sure, assuming I understand correctly that you want to use the transformations that diagonalize X to update other matrices. If you have computed the U, and V for a single decomposition, you can apply those matrices to update other matrices using matrix multiplication/inversion (where “inversion” should mean back/forward substitution since you will have permuted triangular matrices as your factors.) If you use optimized routines for that it will be faster than trying to incrementally do a bunch of individual column and row operations as updates on other matrices. Even in computing the original factorization, it’s preferable to minimize those sorts of operations.

A simple in-place decomposition that works for square or tall matrices could be something like:

function lu_complete!(A)
m, n = size(A)
p1 = collect(1:m)
p2 = collect(1:n)
for j in 1:n
imax, jmax = findmax(abs, view(A, j:m, j:n))[2].I
A[imax + j - 1, jmax + j - 1] == 0 &&  break
if jmax != j
swap_col!(view(A, :, j:n), 1, jmax)
p2[j], p2[jmax + j - 1] = p2[jmax + j - 1], p2[j]
end
if imax != j
swap_row!(view(A, j:n, 1:n), 1, imax)
p1[j], p1[imax + j - 1] = p1[imax + j - 1], p1[j]
end
for i in j+1:m
A[i, j] = A[i, j] / A[j, j]
end
for k in j+1:n
for i in j+1:m
A[i, k] = A[i, k] - A[i, j] * A[j, k]
end
end
for k in j+1:n
A[j, k] = A[j, k] / A[j, j]
end
end
return p1, A, p2
end

function getLDU(A)
m, n = size(A)
L = zeros(eltype(A), m, n)
U = zeros(eltype(A), n, n)
d = zeros(eltype(A), n)
for j in 1:n
L[j,j] = one(eltype(A))
d[j] = A[j,j]
U[j,j] = one(eltype(A))
for i in j+1:m
L[i, j] = A[i, j]
end
for i in 1:j-1
U[i, j] = A[i, j]
end
end
return L, Diagonal(d), U
end

A0 = randn(1100,1000)
A = copy(A0)
@time p1, A, p2 = lu_complete!(A)
@time L, D, U = getLDU(A)
display(opnorm(L*D*U - A0[p1, p2]))
display(opnorm(L[invperm(p1), :]*D*U[:, invperm(p2)] - A0))


Your factors that you are interested in are the permuted versions of L and U. I changed your swap routines to accept an AbstractMatrix so they would work with views. Julia doesn’t seem to allow for rectangular UnitLowerTriangular matrices, which is why getLDU pulls the factors out as ordinary matrices. This seemed to work for rational matrices as well, although you overflow pretty quickly if you don’t use BigInt. There’s still potentially a lot of gain to be made by doing it in blocked form as well. I also chose to form L as rectangular and D as square, so it’s a “compact” form of the decomposition. But it’s trivial to extend L with an identity block and D with a zero block to get L as a square invertible matrix.

This discussion is way over my head, but I just wanted to make sure that the OP is aware of the computer algebra system Nemo.jl and its functionality. From the documentation here on matrices : “Nemo allows the creation of dense matrices over any computable ring RR. There are two different kinds of implementation: a generic one for the case where no specific implementation exists (provided by AbstractAlgebra.jl), and efficient implementations of matrices over numerous specific rings…” where the specific rings are listed immediately following.

The documentation on Matrix functionality lists functions to perform elementary row and column operations (including in-place operations), and functions to compute matrix rank, determinant, trace, LU decomposition, and many others. Hope this helps.

4 Likes

Unfortunately I cannot point you to the literature on that technique, not my field, sorry (pun not intended).

Thank you for the useful comments on ranks. In my case, though, rank is just a byproduct of my function, the main point are matrices U,V, which give the basis of homology. Rank is used purely to know which columns of Vi constitute \mathrm{Ker} X and which columns of U give \mathrm{Coker} X.

this is fun. Persistent homology perchance?

It is fun : ). For now I’m dealing with just chain homology, but I’m building toward persistence. We’ll see how it goes.

So a good approach uses homotopy (e.g. discrete morse) where possible, and only uses linear algebra on the remainder (e.g. the critical simplices when doing discrete morse).

Precisely! And in my opening post, I deal with the case of when the sparse matrices were reduced to smaller size but much higher density, so it makes sense to convert them to Matrix{tw}.

PS. Also look at Uli Bauer’s Ripser / oblivious rank computation.

I’ve used Ripser in the past and quite liked it, though I haven’t yet delved into how it works. I know also of Flagser, Eirene, Gudhy, Dionysus, Perseus, LinBox, but have had only minimal exposure.

I’ll comment the rest tomorrow.

Sure, assuming I understand correctly that you want to use the transformations that diagonalize X to update other matrices.

That is the goal, yes.

@mstewart Running your code gives a BoundsError at swap_col!. You probably meant to use
swap_row!(view(A, j:m, :), 1, imax) instead of
swap_row!(view(A, j:n, 1:n), 1, imax)?

I have rewritten your function to accept tall as well as wide matrices. Forgive me for refactoring a little bit, I just found it easier to understand it this way:

function decompose_LDU!(X ::Matrix{tw}; vrb ::Int =1) ::Tuple{Vector{Int},Vector{Int}}  where {tw}
#= Given an mxn matrix X over a field, we change it in-place and return
permutations σ and τ, so that for L, D, U = X[i<j], X[i=j], X[i>j] = getLDU(X),
there holds L*D*U = X[σ,τ], or eqivalently, L[σ⁻¹,:]*D*U[:,τ⁻¹] = X. =#
m, n = size(X)
mn = min(m,n)
σ = collect(1:m)
τ = collect(1:n)
for k=1:mn
i, j = loc_max(X, k, k)
iszero(X[i,j]) && break
if i!=k
swap_row!(X,k,i)
σ[k], σ[i] = σ[i], σ[k]
end
if j!=k
swap_col!(X,k,j)
τ[k], τ[j] = τ[j], τ[k]
end
for i=k+1:m
X[i,k] /= X[k,k]
end
for i=k+1:m, j=k+1:n
X[i,j] -= X[i,k]*X[k,j]
end
for j=k+1:n
X[k,j] /= X[k,k]
end
end
return σ, τ
end
function getLDU(X ::Matrix{tw}) ::Tuple{Matrix{tw},Vector{tw},Matrix{tw}}  where {tw}
#= Given an mxn matrix X, we return matrices L,D,U of shape mxk, kxk, kxn, where k=min(m,n),
that consist of all entries from X that lie below/on/above the diagonal. =#
m,n = size(X)
k = min(m,n)
_1 = one(tw);
L = zeros(tw, m, k)
D = [X[i,i] for i=1:k]
U = zeros(tw, k, n)
for i=1:m, j=1:n
if i>j
L[i,j] = X[i,j]
end
if i==j
L[i,j] = _1
U[i,j] = _1
end
if i<j
U[i,j] = X[i,j]
end
end
return L, D, U
end


When benchmarking this on smallish matrices, we get impressive speedups:

julia> X0=randn(1005,1000);  X=copy(X0);
julia> @time σ,τ=decompose_LDU!(X,vrb=0);
2.565369 seconds (5 allocations: 16.031 KiB)
julia> @time L,D,U=getLDU(X);
0.016085 seconds (9 allocations: 15.305 MiB)
julia> opnorm(L*Diagonal(D)*U - X0[σ,τ])
3.2125044038923325e-13

julia> X0=randn(1000,1005);  X=copy(X0);
julia> @time σ,τ=decompose_LDU!(X,vrb=0);
2.159218 seconds (5 allocations: 16.031 KiB)
julia> @time L,D,U=getLDU(X);
0.009783 seconds (9 allocations: 15.305 MiB)
julia> opnorm(L*Diagonal(D)*U - X0[σ,τ])
3.310874370695303e-13


So we went from 14s to 2s. I will leave this running through the night for m=n=10^4 and compare with my parallelized decompose_diag_over_field!.

I am still very much interested in why my parallelization in decompose_diag_over_field! gives an incorrect result. If nothing else, I’d like to understand the pitfalls of multithreading.

There’s still potentially a lot of gain to be made by doing it in blocked form as well.

Could you please elaborate on that?

What would be the best way to optimize this diagonalization or LDU decomposition for matrices over \mathbb{Z}_2, like Matrix{Bool} or BitMatrix (with +=xor and *=&). I’m aiming at matrices of size 10^5\times10^5.

@PeterSimon Thank you for suggesting Nemo.jl. I know of it and AbstractAlgebra.jl (I also used Singular, GAP, Macaulay a few years ago), though I’m not up-to-date on all its functionalities. Partly due to being worried that it brings lots of overhead.

For example, here performance of + and *, suggested improvement · Issue #31 · scheinerman/Mods.jl · GitHub I constructed my own implementation of the ring \mathbb{Z}_m. When I benchmarked multiplying two matrices of size 10^4\times10^4, my implementation needed 3684s, Nemo’s needed 28602s.

For diagonalization/LU decomposition, do you think the solutions in Nemo.jl or AbstractAlgebra.jl are more performant than mstewart’s code above?

Unfortunately no julia bindings afaiu. But it would be a great community service if you could write bindings