How to improve efficiency of my linear algebra code (matrix equivalency)

I have no idea. Maybe worth a test?

I think Nemo or AbstractAlgebra might help, but this depends on what exactly is wanted (it is still unclear to me):

  • Looking at the code here, this is all very floating point centric (making use of abs or comparing things with maximum(abs.(...)) or ā€œfinding the largest/smallestā€ pivot). This is the opposite to symbolic/exact computations.
  • It is talked about computable fields, but then things like localizations \mathbf{Z}_{(p)} are mentioned, which are not fields.
  • Deciding matrix equivalence reduces to the computation of the Smith normal form, so I donā€™t understand why this is not wanted?
1 Like

I wrote this for the square case and then after writing a post I noticed you had mentioned rectangular matrices, so I made some last minute changes. I thought I had tested the version that ended up in the post, but it does look like I messed that up.

There is an old SIAM Review article that covers the main idea for LU factorization in section 4. Thereā€™s a lot of useful context on other things like BLAS operations that you might find helpful if you havenā€™t seen any of this before. Thereā€™s also a very old looking web version. In the notation of the paper, the speed largely comes from updating the trailing submatrix E all at once using E' = E - L_1 U_1 which involves matrix-matrix multiply. An optimized matrix-matrix multiply can be used which will ideally be threaded and will do further blocking of L_1 and U_1 to multiply smaller blocks (say b\times b) that fit in cache. Those smaller matrix multiplies do O(b^3) work but involve reading in only O(b^2) data. A row operation involving two rows might do O(n) work on O(n) data, so with matrix-matrix multiply you get much more of the computation done for the amount of data you have to read (slowly) from memory. Arranging computations in this way to get most of the computation into an optimized matrix-matrix multiply (level 3 BLAS) is a huge win for performance. Itā€™s basically the answer to your original question about why the SVD is so fast. Itā€™s certainly possible to make an LU factorization even faster. And itā€™s also why I suggested delaying the updates to your matrices U, V, etc. Those will also be much faster if they are done with matrix-matrix multiply (or a similarly optimized back/forward substitution if you need inverses).

Note that the description of blocked LU factorization does not include pivoting (permutations) or factoring the pivots out of U. The latter is easy to do after the fact, although I incorporated it into the main loop in my version. Pivoting is fairly easy to incorporate if you use partial pivoting (use row permutations to get the largest element in absolute value in the current column in which you are doing elimination and use that as the pivot.) Partial pivoting unfortunately is not reliable at revealing rank, unfortunately, so the degree of rank deficiency in A will not necessarily match the number of small pivots you have in D. Thatā€™s why I suggested complete pivoting.

However complete pivoting has a problem if you want to do the faster blocked version: You need to know the elements of the trailing submatrix to choose pivots, even though your goal is to defer updating that block until you have L_1 and U_1 and are ready to form E-L_1 U_1. You can form incremental updates to individual elements of E to look for candidate pivots, but since complete pivoting involves looking at all of E to find the largest pivot, this effectively requires you to update E the slow way without matrix-matrix multiply. That approach to incremental updating of individual elements of E to find a pivot is discussed in this LAPACK working note in the context of pivoted Cholesky. It works there because you are looking for pivots that are on the diagonal, so you donā€™t have to look at that many updated elements of E to find a pivot. For LU factorization you could use Rook pivoting which attempts to find a pivot that is the largest element in absolute value in both its column and row. It addresses the problems with partial pivoting in terms of determining numerical rank while limiting how many elements of an updated E you have to look at to find the pivot. There is a Matlab package for this that is based on Fortran code, but I havenā€™t used it. I havenā€™t seen anything like this in Julia. It would be somewhat nontrivial to implement this. I have done similar things before, but I expect even having some experience, Iā€™d spend at least a full day getting something basic working. That might be overly optimistic depending on how much I got right on a first attempt and how much debugging needed to be done.

Note that I am assuming here that for floating point you want to use this factorization to find numerical rank through looking at small pivots, i.e. that you are interested in knowing whether your matrix is close to a matrix with a given degree of rank deficiency. For true fields, you obviously are looking for exact rank deficiency. They are different things as has already been pointed out in this discussion.

Of course, in a finite field, you just need to find a nonzero pivot. Just doing incremental updates to E until you find a nonzero pivot would work and shouldnā€™t hurt performance. But you need to have a good matrix-matrix multiply routine for your field to get the benefits of blocking. (for floating point you can use mul! in Julia for this.) I donā€™t know if thereā€™s anything out there. We work in very different areas and I almost never look at anything except floating point. Implementing the matrix-matrix multiply and getting it threaded and working semi-efficiently is likely to be significantly more work than the LU factorization. And trying to get it as efficient as a good level 3 BLAS implementation would be extremely difficult. I very much doubt my ability to do that in a reasonable amount of time (or at all). Googling does turn up some code for finite field BLAS, but I donā€™t know anything about them.

So if you want to do some work, you probably have at least a potential factor of 10 gain in efficiency for floating point and similar principles should apply for finite fields, but that might be even more work. I think the rationals are an entirely different thing. As someone who mostly just does floating point Iā€™ve always assumed doing nontrivial matrix factorization with large matrices over the rationals is intractable. I think some of the other people responding to you here know more about that than I do. If there are specialized algorithms for working with rationals, I would not be the sort of person who would know about them.

3 Likes

After writing a long response, I did think of asking a question that I wonā€™t bury in my much longer response: Is there any particular reason you are focused on using a single elimination method for multiple fields? It seems to me like you are making your life more difficult with that and the only reason I can think of is that elimination works in a general field and writing your own code allowed you to incorporate the incremental updates to U and V (and inverses), which you would likely not have with an off-the-shelf code. But given that it is likely to be more efficient to update your matrices using matrix multiplication after computing a factorization, it seems like you are free to pick whatever factorization algorithm is most appropriate in each field. So for floating point you could use SVD or column pivoted QR factorization. And there might be efficient factorization routines for finite fields in the packages others have pointed out. Itā€™s not hard to draw me into a discussion on numerical linear algebra, and I sometimes jump in without thinking about the larger context, but on reflection I am wondering why you want to do it this way.

2 Likes

Unfortunately no julia bindings afaiu. But it would be a great community service if you could write bindings :wink:

I never did any programming in C or C++, and I havenā€™t created any bindings, so Iā€™m afraid I wonā€™t be of much help with that. Iā€™ll use my own implementation of \mathbb{Z}_p for now.

I will leave this running through the night for m=n=10^4 and compare with my parallelized decompose_diag_over_field!.

Here are the timings:

julia> X0 = randn(10^4,10^4);   
julia> X = copy(X0);
julia> @time Ļƒ,Ļ„ = decompose_LDU!(X,vrb=1);  
7539.612054 seconds (5 allocations: 15.969 KiB)
julia> @time L,D,U = getLDU(X);
  2.536451 seconds (10 allocations: 1.490 GiB, 0.47% gc time)
julia> opnorm(L*Diagonal(D)*U - X0[Ļƒ,Ļ„])
8.690001994988724e-12

julia> X = rand(-2:0.0001:2, (m,n)); 
julia> D, U, V, Ui, Vi = get_data(X);
julia> @time decompose_diag_over_field!(D, U=U, V=V, prl=1);
7836.116055 seconds (9 allocations: 15.267 MiB)
julia> maximum(abs.(X-U*D*V))
519.5279274091533

Both solutions needed roughly 130min, though decompose_LDU used only 1 thread, my decompose_diag_over_field! used 6 threads but gave the wrong answer.
If I could fix the incorrect output, then on more CPU cores, the latter solution would perhaps be faster. Iā€™m not sure if LDU above can be parallelized.

  • Looking at the code here, this is all very floating point centric (making use of abs or comparing things with maximum(abs.(...)) or ā€œfinding the largest/smallestā€ pivot). This is the opposite to symbolic/exact computations.

If inside decompose_diag_over_field! we replace loc_max with loc_pivot,
where the latter calls loc_max for floats, loc_first_nonzero for finite fields, ā€¦, then the algorithm diagonalizes a matrix over any field. I am benchmarking it for finite fields, but I didnā€™t want to overload this thread with too much code.

  • It is talked about computable fields, but then things like localizations \mathbb{Z}_{(p)} are mentioned, which are not fields.

My mistake, those localizations fall under PIDs, which are relevant for my problems, but not for this thread.

  • Deciding matrix equivalence reduces to the computation of the Smith normal form, so I donā€™t understand why this is not wanted?

It is very much wanted, but computing SNF for matrices larger than 10Kx10K is a vastly harder problem (explosion of coefficients). So Iā€™d be very happy already with efficient algorithms for fields.

@thofma Given a matrix over a finite field, how can I compute its LU decomposition with Nemo.jl, AbstractAlgebra.jl, LinearAlgebraX.jl, or any other existing library in Julia? (to see if it is any faster than mstewartā€™s above) Iā€™ve found this, but itā€™s so damn hard to just create a simple matrix in Nemo. The docs are not helpful.

@mstewart Thank you very much for the additional info on LU speedups!!! This is certainly a much deeper topic than I imagined. What you described looks larger than the scope of this thread, so I donā€™t want to burden you with that. I will need a bit more time to think things over.

As I understand, in Julia the default lu for floats calls BLAS/LAPACK implementation from Fortran, it does not have its own? (I canā€™t yet find my way around there)

Of course, in a finite field, you just need to find a nonzero pivot. Just doing incremental updates to E until you find a nonzero pivot would work and shouldnā€™t hurt performance. But you need to have a good matrix-matrix multiply routine for your field to get the benefits of blocking. (for floating point you can use mul! in Julia for this.) I donā€™t know if thereā€™s anything out there.

There is, with Tullio.jl and Octavian.jl. Here, I made a simple implementation of \mathbb{Z}_p. For multiplying two 10Kx10K matrices over \mathbb{Z}_{11}, I needed 3685s. With Tullio, I reduced that to 888s. Still, a long way from 15s for floats, but still. With Octavian, I havenā€™t managed to make it work, seems it is picky about types of matrix entries.

So if you want to do some work, you probably have at least a potential factor of 10 gain in efficiency for floating point and similar principles should apply for finite fields, but that might be even more work.

More work for finite fields?? I thought it would be simpler. Especially over \mathbb{Z}_2. Perhaps I will ask this question for \mathbb{Z}_p in a separate thread.

Is there any particular reason you are focused on using a single elimination method for multiple fields?

No, in the end, I will probably have to do things separately for each field. I just didnā€™t want this diagonalization to become a whole project on its own, since it is just a small part of my global aim. I had hoped that with the same general case for elimination, but specific methods of choosing pivots for each field, with good approach it would already be possible to improve a lot my initial attempt.

My use-case is that I have a whole diagram to mutate (homological algebra). Code stays more readable (easier to debug) if every change of basis in a specific vector space is reflected in all linear maps going into or out of that space. But it can also be done with decompositions, as you suggested.

Also, I naively expected that my initial code was just missing some of @inline, @simd,LoopVectorization.@turbo, Polyester.@batch, ā€¦ which would already speed things up a lot. Now it seems that speeding it up requires more matrix-matrix multiplication, and adjusting my code for that (over finite fields) would be a massive undertaking.

In any case, I am very grateful for all your valuable info.

For anyone out there: Iā€™d appreciate any hint on why the parallelization in my initial code gives a wrong result.

Thanks for the clarification.

The packages that were mentioned will not make you happy if you aim at matrices of size 10^5, but for the sake of completeness, here is how to compute the LU factorization in AbstractAlgebra/Nemo:

julia> using Nemo;

julia> lu(matrix(GF(2), 2, 2, [0, 1, 1, 0]))
(2, (1,2), [1 0; 0 1], [1 0; 0 1])

Thatā€™s certainly what is done by default. LAPACK is the standard almost everywhere, with good reason. Itā€™s very efficient and developed by people with a very high level of expertise. New and improved algorithms are incorporated quite consistently whenever something new is available. Native Julia code would be nice, but it is a very big job to replicate the functionality and performance of LAPACK.
There is also GenericLinearAlgebra.jl, which I have used occasionally when I want to use DoubleFloats.jl. But I donā€™t think that implements an LU factorization as far as I can see. And the fallback BLAS operations are straight-foward implementations without blocking. Presumably the idea is that faster implementations can be provided elsewhere as appropriate for whatever types you need. It does have very readable blocked (and recursive) Cholesky factorization, which might be a nice thing to look at if you are interested in getting a faster LU code working. The LU would be more complicated with pivoting, but you might benefit from starting by looking at something simpler. Between looking at that and the SIAM Review paper, it should give you some pretty decent guidance if on how to get started if you want to do an implementation. And I certainly donā€™t mind answering questions if you have any more.

Your goal of handling 10^5\times 10^5 matrices with finite fields seems reasonable with that sort of code if Tullio.jl effectively gives you satisfactory matrix multiplication. Your timings for Tullio.jl with \mathbb{Z}_{11} still sound painfully slow to me, but if you can live with that, thereā€™s not any reason to expect that LU factorization should be any slower than a full matrix-matrix multiplication of similar size. LU does fewer floating point operations, and benefits from the same optimizations. In floating point on my system the LAPACK LU is less than half the time of matrix-matrix multiplication. I would guess that the blocked LU with a finite field should behave similarly.

There have been a few times recently that Iā€™ve thought it would be nice to have an efficient LU factorization with a stronger pivoting scheme than partial pivoting. I donā€™t think there would be any problem with giving some flexibility in choice of pivoting scheme as you had set up in your original code. I might try to get something going with rook pivoting and level 3 BLAS at some point. But at the moment I have a lot of things higher up my to-do list.

I just meant that I didnā€™t know if you would have to implement some BLAS replacement for finite fields, which would be a pain. It appears that Tullio.jl might have you covered for that. I didnā€™t know that it did blocking and could compete with BLAS on matrix multiplication. I havenā€™t really used it. Maybe I should start. It looks very nice.

1 Like

@thofma Benchmarking Nemo.jl for LU:

julia> using Nemo;
julia> R, n = Nemo.GF(2), 10^4;
julia> S = MatrixSpace(R, n, n);
julia> @time X = S(rand(0:1,n,n));
 40.401850 seconds (200.00 M allocations: 5.215 GiB, 25.96% gc time)
julia> @time F = Nemo.lu(X);
  1.146130 seconds (1.95 k allocations: 253.450 MiB)

How is lu so fast? Creating X, though, is awfully slow. And what are the components of F? It is very hard to use the documentation of Nemo. Here, I cannot find lu. Here, I found lufact and fflu, but those are not available anymore?

Your timings for Tullio.jl with \mathbb{Z}_{11} still sound painfully slow to me, but if you can live with that, thereā€™s not any reason to expect that \mathrm{LU} factorization should be any slower than a full matrix-matrix multiplication of similar size.

Octavian.jl already implements matrix multiplication (in pure Julia), which for Float64s is as fast as BLAS, and for Int64s is 10x faster than BLAS. My secret hope is that one day, it might accept custom types of entries in matrices, provided that the type is subject to certain restrictions, like fixed size, no branches in + and *, ā€¦

I just did a little digging. It looks like this calls a C library called Flint. If you look at the documentation the LU does a block recursive factorization, which is another approach to using cache efficiently. Itā€™s newer than the approach described in the 1992 SIAM Review article. Here is a paper by one of the LAPACK authors, obviously with floating point in mind, but the general principle should be similar.

It does compute this using partial pivoting to get a decomposition PA = LU. This would be a row echelon form. That form of pivoting would be a deal-breaker for numerical rank determination with floating point, but for a finite field itā€™s not problem and it would still find the rank as the number of nonzero rows in U. You should be able to use it for finite fields although if you really want a factorization with a diagonal in the middle and nonsingular matrices on either side, youā€™ll need to fill in the zero rows of U to make it nonsingular and put corresponding zeros for those rows in D.

Me too. I wanted complex floating point types. Iā€™ve used Octavian and it is very impressive. But I donā€™t know about its future. Itā€™s based on LoopVectorization.jl which will be deprecated with Julia 1.11 unless someone steps up to maintain it, which seems like something that would be very hard for most people to do.

2 Likes

Yes, because the documentation you link to is from 2017. It is described here Matrix functionality Ā· AbstractAlgebra.jl, ee a

help?> lu(matrix(GF(2), 2, 2, [1, 2, 3, 4]))
[...]
u(A::MatrixElem{T}, P = SymmetricGroup(nrows(A))) where {T <: FieldElement}

  Return a tuple r, p, L, U consisting of the rank of A, a permutation p of A belonging to P, a lower triangular matrix L and an upper triangular matrix U such that p(A) = LU, where p(A) stands for the matrix whose rows are the given permutation p of the rows of A.

For this size, it uses a recursive approach, which reduces the problem to matrix multiplication. This is pretty standard in computer algebra (for fields with arithmetic in constant time). The multiplication itself is probably Strassen and/or delegated to BLAS.

1 Like

The answer to my original question (2) can be found here: https://discourse.julialang.org/t/parallel-matrix-col-row-operations-give-incorrect-results/

There are also benchmarks for lu over finite fields.

gives an example of extending Octavian for another number type.

Octavianā€™s cache blocking approach is really bad.

There will likely be something better and more generic in the future.

1 Like

Thatā€™s because the LinearAlgebra stdlib already implements a generic LU factorization.

1 Like

@Elrod Iā€™m delighted to hear you are working on matrix multiply over entries other than floats!!! Tropical numbers do not cover the case of GF(2), so Iā€™m really hopeful for eventual \mathbb{Z}/p support. Do you think it is possible to surpass the performance of flint and fflas-ffpack?

Octavianā€™s cache blocking approach is really bad.

Really?? Octavian achieved timings roughly the same as BLAS, the golden standard in industry. Are you saying that with a different approach, it could be possible to improve the timings even further?

One more question: are there any plans for a Julia lu! with similar performance as in Octavian, but over \mathbb{Z}/p entries? Or at least over \mathbb{Z}/2?

@stevengj Thanks for the link to generic_lufact!, was not aware of its existence. However, that will not run over GF(p) since it uses abs, isless.