I have no idea. Maybe worth a test?
I think Nemo or AbstractAlgebra might help, but this depends on what exactly is wanted (it is still unclear to me):
- Looking at the code here, this is all very floating point centric (making use of
abs
or comparing things withmaximum(abs.(...))
or āfinding the largest/smallestā pivot). This is the opposite to symbolic/exact computations. - It is talked about computable fields, but then things like localizations \mathbf{Z}_{(p)} are mentioned, which are not fields.
- Deciding matrix equivalence reduces to the computation of the Smith normal form, so I donāt understand why this is not wanted?
I wrote this for the square case and then after writing a post I noticed you had mentioned rectangular matrices, so I made some last minute changes. I thought I had tested the version that ended up in the post, but it does look like I messed that up.
There is an old SIAM Review article that covers the main idea for LU factorization in section 4. Thereās a lot of useful context on other things like BLAS operations that you might find helpful if you havenāt seen any of this before. Thereās also a very old looking web version. In the notation of the paper, the speed largely comes from updating the trailing submatrix E all at once using E' = E - L_1 U_1 which involves matrix-matrix multiply. An optimized matrix-matrix multiply can be used which will ideally be threaded and will do further blocking of L_1 and U_1 to multiply smaller blocks (say b\times b) that fit in cache. Those smaller matrix multiplies do O(b^3) work but involve reading in only O(b^2) data. A row operation involving two rows might do O(n) work on O(n) data, so with matrix-matrix multiply you get much more of the computation done for the amount of data you have to read (slowly) from memory. Arranging computations in this way to get most of the computation into an optimized matrix-matrix multiply (level 3 BLAS) is a huge win for performance. Itās basically the answer to your original question about why the SVD is so fast. Itās certainly possible to make an LU factorization even faster. And itās also why I suggested delaying the updates to your matrices U, V, etc. Those will also be much faster if they are done with matrix-matrix multiply (or a similarly optimized back/forward substitution if you need inverses).
Note that the description of blocked LU factorization does not include pivoting (permutations) or factoring the pivots out of U. The latter is easy to do after the fact, although I incorporated it into the main loop in my version. Pivoting is fairly easy to incorporate if you use partial pivoting (use row permutations to get the largest element in absolute value in the current column in which you are doing elimination and use that as the pivot.) Partial pivoting unfortunately is not reliable at revealing rank, unfortunately, so the degree of rank deficiency in A will not necessarily match the number of small pivots you have in D. Thatās why I suggested complete pivoting.
However complete pivoting has a problem if you want to do the faster blocked version: You need to know the elements of the trailing submatrix to choose pivots, even though your goal is to defer updating that block until you have L_1 and U_1 and are ready to form E-L_1 U_1. You can form incremental updates to individual elements of E to look for candidate pivots, but since complete pivoting involves looking at all of E to find the largest pivot, this effectively requires you to update E the slow way without matrix-matrix multiply. That approach to incremental updating of individual elements of E to find a pivot is discussed in this LAPACK working note in the context of pivoted Cholesky. It works there because you are looking for pivots that are on the diagonal, so you donāt have to look at that many updated elements of E to find a pivot. For LU factorization you could use Rook pivoting which attempts to find a pivot that is the largest element in absolute value in both its column and row. It addresses the problems with partial pivoting in terms of determining numerical rank while limiting how many elements of an updated E you have to look at to find the pivot. There is a Matlab package for this that is based on Fortran code, but I havenāt used it. I havenāt seen anything like this in Julia. It would be somewhat nontrivial to implement this. I have done similar things before, but I expect even having some experience, Iād spend at least a full day getting something basic working. That might be overly optimistic depending on how much I got right on a first attempt and how much debugging needed to be done.
Note that I am assuming here that for floating point you want to use this factorization to find numerical rank through looking at small pivots, i.e. that you are interested in knowing whether your matrix is close to a matrix with a given degree of rank deficiency. For true fields, you obviously are looking for exact rank deficiency. They are different things as has already been pointed out in this discussion.
Of course, in a finite field, you just need to find a nonzero pivot. Just doing incremental updates to E until you find a nonzero pivot would work and shouldnāt hurt performance. But you need to have a good matrix-matrix multiply routine for your field to get the benefits of blocking. (for floating point you can use mul!
in Julia for this.) I donāt know if thereās anything out there. We work in very different areas and I almost never look at anything except floating point. Implementing the matrix-matrix multiply and getting it threaded and working semi-efficiently is likely to be significantly more work than the LU factorization. And trying to get it as efficient as a good level 3 BLAS implementation would be extremely difficult. I very much doubt my ability to do that in a reasonable amount of time (or at all). Googling does turn up some code for finite field BLAS, but I donāt know anything about them.
So if you want to do some work, you probably have at least a potential factor of 10 gain in efficiency for floating point and similar principles should apply for finite fields, but that might be even more work. I think the rationals are an entirely different thing. As someone who mostly just does floating point Iāve always assumed doing nontrivial matrix factorization with large matrices over the rationals is intractable. I think some of the other people responding to you here know more about that than I do. If there are specialized algorithms for working with rationals, I would not be the sort of person who would know about them.
After writing a long response, I did think of asking a question that I wonāt bury in my much longer response: Is there any particular reason you are focused on using a single elimination method for multiple fields? It seems to me like you are making your life more difficult with that and the only reason I can think of is that elimination works in a general field and writing your own code allowed you to incorporate the incremental updates to U and V (and inverses), which you would likely not have with an off-the-shelf code. But given that it is likely to be more efficient to update your matrices using matrix multiplication after computing a factorization, it seems like you are free to pick whatever factorization algorithm is most appropriate in each field. So for floating point you could use SVD or column pivoted QR factorization. And there might be efficient factorization routines for finite fields in the packages others have pointed out. Itās not hard to draw me into a discussion on numerical linear algebra, and I sometimes jump in without thinking about the larger context, but on reflection I am wondering why you want to do it this way.
Unfortunately no julia bindings afaiu. But it would be a great community service if you could write bindings
I never did any programming in C or C++, and I havenāt created any bindings, so Iām afraid I wonāt be of much help with that. Iāll use my own implementation of \mathbb{Z}_p for now.
I will leave this running through the night for m=n=10^4 and compare with my parallelized
decompose_diag_over_field!
.
Here are the timings:
julia> X0 = randn(10^4,10^4);
julia> X = copy(X0);
julia> @time Ļ,Ļ = decompose_LDU!(X,vrb=1);
7539.612054 seconds (5 allocations: 15.969 KiB)
julia> @time L,D,U = getLDU(X);
2.536451 seconds (10 allocations: 1.490 GiB, 0.47% gc time)
julia> opnorm(L*Diagonal(D)*U - X0[Ļ,Ļ])
8.690001994988724e-12
julia> X = rand(-2:0.0001:2, (m,n));
julia> D, U, V, Ui, Vi = get_data(X);
julia> @time decompose_diag_over_field!(D, U=U, V=V, prl=1);
7836.116055 seconds (9 allocations: 15.267 MiB)
julia> maximum(abs.(X-U*D*V))
519.5279274091533
Both solutions needed roughly 130min, though decompose_LDU
used only 1 thread, my decompose_diag_over_field!
used 6 threads but gave the wrong answer.
If I could fix the incorrect output, then on more CPU cores, the latter solution would perhaps be faster. Iām not sure if LDU above can be parallelized.
- Looking at the code here, this is all very floating point centric (making use of
abs
or comparing things withmaximum(abs.(...))
or āfinding the largest/smallestā pivot). This is the opposite to symbolic/exact computations.
If inside decompose_diag_over_field!
we replace loc_max
with loc_pivot
,
where the latter calls loc_max
for floats, loc_first_nonzero
for finite fields, ā¦, then the algorithm diagonalizes a matrix over any field. I am benchmarking it for finite fields, but I didnāt want to overload this thread with too much code.
- It is talked about computable fields, but then things like localizations \mathbb{Z}_{(p)} are mentioned, which are not fields.
My mistake, those localizations fall under PIDs, which are relevant for my problems, but not for this thread.
- Deciding matrix equivalence reduces to the computation of the Smith normal form, so I donāt understand why this is not wanted?
It is very much wanted, but computing SNF for matrices larger than 10Kx10K is a vastly harder problem (explosion of coefficients). So Iād be very happy already with efficient algorithms for fields.
@thofma Given a matrix over a finite field, how can I compute its LU decomposition with Nemo.jl, AbstractAlgebra.jl, LinearAlgebraX.jl, or any other existing library in Julia? (to see if it is any faster than mstewartās above) Iāve found this, but itās so damn hard to just create a simple matrix in Nemo. The docs are not helpful.
@mstewart Thank you very much for the additional info on LU speedups!!! This is certainly a much deeper topic than I imagined. What you described looks larger than the scope of this thread, so I donāt want to burden you with that. I will need a bit more time to think things over.
As I understand, in Julia the default lu
for floats calls BLAS/LAPACK implementation from Fortran, it does not have its own? (I canāt yet find my way around there)
Of course, in a finite field, you just need to find a nonzero pivot. Just doing incremental updates to E until you find a nonzero pivot would work and shouldnāt hurt performance. But you need to have a good matrix-matrix multiply routine for your field to get the benefits of blocking. (for floating point you can use
mul!
in Julia for this.) I donāt know if thereās anything out there.
There is, with Tullio.jl and Octavian.jl. Here, I made a simple implementation of \mathbb{Z}_p. For multiplying two 10Kx10K matrices over \mathbb{Z}_{11}, I needed 3685s. With Tullio, I reduced that to 888s. Still, a long way from 15s for floats, but still. With Octavian, I havenāt managed to make it work, seems it is picky about types of matrix entries.
So if you want to do some work, you probably have at least a potential factor of 10 gain in efficiency for floating point and similar principles should apply for finite fields, but that might be even more work.
More work for finite fields?? I thought it would be simpler. Especially over \mathbb{Z}_2. Perhaps I will ask this question for \mathbb{Z}_p in a separate thread.
Is there any particular reason you are focused on using a single elimination method for multiple fields?
No, in the end, I will probably have to do things separately for each field. I just didnāt want this diagonalization to become a whole project on its own, since it is just a small part of my global aim. I had hoped that with the same general case for elimination, but specific methods of choosing pivots for each field, with good approach it would already be possible to improve a lot my initial attempt.
My use-case is that I have a whole diagram to mutate (homological algebra). Code stays more readable (easier to debug) if every change of basis in a specific vector space is reflected in all linear maps going into or out of that space. But it can also be done with decompositions, as you suggested.
Also, I naively expected that my initial code was just missing some of @inline
, @simd
,LoopVectorization.@turbo
, Polyester.@batch
, ā¦ which would already speed things up a lot. Now it seems that speeding it up requires more matrix-matrix multiplication, and adjusting my code for that (over finite fields) would be a massive undertaking.
In any case, I am very grateful for all your valuable info.
For anyone out there: Iād appreciate any hint on why the parallelization in my initial code gives a wrong result.
Thanks for the clarification.
The packages that were mentioned will not make you happy if you aim at matrices of size 10^5, but for the sake of completeness, here is how to compute the LU factorization in AbstractAlgebra/Nemo:
julia> using Nemo;
julia> lu(matrix(GF(2), 2, 2, [0, 1, 1, 0]))
(2, (1,2), [1 0; 0 1], [1 0; 0 1])
Thatās certainly what is done by default. LAPACK is the standard almost everywhere, with good reason. Itās very efficient and developed by people with a very high level of expertise. New and improved algorithms are incorporated quite consistently whenever something new is available. Native Julia code would be nice, but it is a very big job to replicate the functionality and performance of LAPACK.
There is also GenericLinearAlgebra.jl, which I have used occasionally when I want to use DoubleFloats.jl. But I donāt think that implements an LU factorization as far as I can see. And the fallback BLAS operations are straight-foward implementations without blocking. Presumably the idea is that faster implementations can be provided elsewhere as appropriate for whatever types you need. It does have very readable blocked (and recursive) Cholesky factorization, which might be a nice thing to look at if you are interested in getting a faster LU code working. The LU would be more complicated with pivoting, but you might benefit from starting by looking at something simpler. Between looking at that and the SIAM Review paper, it should give you some pretty decent guidance if on how to get started if you want to do an implementation. And I certainly donāt mind answering questions if you have any more.
Your goal of handling 10^5\times 10^5 matrices with finite fields seems reasonable with that sort of code if Tullio.jl
effectively gives you satisfactory matrix multiplication. Your timings for Tullio.jl
with \mathbb{Z}_{11} still sound painfully slow to me, but if you can live with that, thereās not any reason to expect that LU factorization should be any slower than a full matrix-matrix multiplication of similar size. LU does fewer floating point operations, and benefits from the same optimizations. In floating point on my system the LAPACK LU is less than half the time of matrix-matrix multiplication. I would guess that the blocked LU with a finite field should behave similarly.
There have been a few times recently that Iāve thought it would be nice to have an efficient LU factorization with a stronger pivoting scheme than partial pivoting. I donāt think there would be any problem with giving some flexibility in choice of pivoting scheme as you had set up in your original code. I might try to get something going with rook pivoting and level 3 BLAS at some point. But at the moment I have a lot of things higher up my to-do list.
I just meant that I didnāt know if you would have to implement some BLAS replacement for finite fields, which would be a pain. It appears that Tullio.jl
might have you covered for that. I didnāt know that it did blocking and could compete with BLAS on matrix multiplication. I havenāt really used it. Maybe I should start. It looks very nice.
@thofma Benchmarking Nemo.jl for LU:
julia> using Nemo;
julia> R, n = Nemo.GF(2), 10^4;
julia> S = MatrixSpace(R, n, n);
julia> @time X = S(rand(0:1,n,n));
40.401850 seconds (200.00 M allocations: 5.215 GiB, 25.96% gc time)
julia> @time F = Nemo.lu(X);
1.146130 seconds (1.95 k allocations: 253.450 MiB)
How is lu
so fast? Creating X
, though, is awfully slow. And what are the components of F? It is very hard to use the documentation of Nemo. Here, I cannot find lu
. Here, I found lufact
and fflu
, but those are not available anymore?
Your timings for Tullio.jl with \mathbb{Z}_{11} still sound painfully slow to me, but if you can live with that, thereās not any reason to expect that \mathrm{LU} factorization should be any slower than a full matrix-matrix multiplication of similar size.
Octavian.jl
already implements matrix multiplication (in pure Julia), which for Float64
s is as fast as BLAS, and for Int64
s is 10x faster than BLAS. My secret hope is that one day, it might accept custom types of entries in matrices, provided that the type is subject to certain restrictions, like fixed size, no branches in +
and *
, ā¦
I just did a little digging. It looks like this calls a C library called Flint. If you look at the documentation the LU does a block recursive factorization, which is another approach to using cache efficiently. Itās newer than the approach described in the 1992 SIAM Review article. Here is a paper by one of the LAPACK authors, obviously with floating point in mind, but the general principle should be similar.
It does compute this using partial pivoting to get a decomposition PA = LU. This would be a row echelon form. That form of pivoting would be a deal-breaker for numerical rank determination with floating point, but for a finite field itās not problem and it would still find the rank as the number of nonzero rows in U. You should be able to use it for finite fields although if you really want a factorization with a diagonal in the middle and nonsingular matrices on either side, youāll need to fill in the zero rows of U to make it nonsingular and put corresponding zeros for those rows in D.
Me too. I wanted complex floating point types. Iāve used Octavian and it is very impressive. But I donāt know about its future. Itās based on LoopVectorization.jl which will be deprecated with Julia 1.11 unless someone steps up to maintain it, which seems like something that would be very hard for most people to do.
Yes, because the documentation you link to is from 2017. It is described here Matrix functionality Ā· AbstractAlgebra.jl, ee a
help?> lu(matrix(GF(2), 2, 2, [1, 2, 3, 4]))
[...]
u(A::MatrixElem{T}, P = SymmetricGroup(nrows(A))) where {T <: FieldElement}
Return a tuple r, p, L, U consisting of the rank of A, a permutation p of A belonging to P, a lower triangular matrix L and an upper triangular matrix U such that p(A) = LU, where p(A) stands for the matrix whose rows are the given permutation p of the rows of A.
For this size, it uses a recursive approach, which reduces the problem to matrix multiplication. This is pretty standard in computer algebra (for fields with arithmetic in constant time). The multiplication itself is probably Strassen and/or delegated to BLAS.
The answer to my original question (2) can be found here: https://discourse.julialang.org/t/parallel-matrix-col-row-operations-give-incorrect-results/
There are also benchmarks for lu
over finite fields.
gives an example of extending Octavian for another number type.
Octavianās cache blocking approach is really bad.
There will likely be something better and more generic in the future.
Thatās because the LinearAlgebra stdlib already implements a generic LU factorization.
@Elrod Iām delighted to hear you are working on matrix multiply over entries other than floats!!! Tropical numbers do not cover the case of GF(2), so Iām really hopeful for eventual \mathbb{Z}/p support. Do you think it is possible to surpass the performance of flint and fflas-ffpack?
Octavianās cache blocking approach is really bad.
Really?? Octavian achieved timings roughly the same as BLAS, the golden standard in industry. Are you saying that with a different approach, it could be possible to improve the timings even further?
One more question: are there any plans for a Julia lu!
with similar performance as in Octavian, but over \mathbb{Z}/p entries? Or at least over \mathbb{Z}/2?
@stevengj Thanks for the link to generic_lufact!
, was not aware of its existence. However, that will not run over GF(p) since it uses abs, isless
.