Fast computation of Grammian of matrix

For some model indentification problems, I have to compute the grammian of data matrices i.e. compute A = X'*X where X is a complex rectangular matrix.

Is there a fast implementation of this specific operation in julia knowing that the output matrix is hermitian ?

A first basic assessment would assume this could split in half the computation times as well as the memory requirements. Also this could reduce numerical error propagation leading to a non heremitian numeric output.

Try to search for ā€œkernel matrices juliaā€ and you may find something useful. For example, I found KernelMatrices.jl by @cgeoga

You may also find something if you search for ā€œquadratic form juliaā€, I donā€™t know.

1 Like

Some quick googling finds the Julia benevolent overlords had already gotten you covered almost 10 years ago
https://github.com/JuliaLang/julia/issues/659

using BenchmarkTools
a = rand(100,100); b = rand(100,100);

@benchmark $a' * $a
# median time:      110.190 Ī¼s (0.00% GC)

@benchmark $b' * $a
# median time:      836.717 Ī¼s (0.00% GC)
2 Likes

Just a remark: it is often advantageous from a numerical viewpoint to use algorithms that do not require forming the gramian and instead work with its square root (Cholesky) factor.

@zdenek_hurak could you expand on your comment ? What type of algorithms do you refer to ?

@FMeirinhos thanks, sometimes itā€™s just a matter of keywords !

I meant something in the line of square-root balancing used for model order reduction (Redirecting or Balancing free square-root algorithm for computing singular perturbation approximations | IEEE Conference Publication | IEEE Xplore and certainly described elsewhere). Perhaps it may be relevant for you problem, maybe not. Just in case, the author has been actively developing MatrixEquations.jl package. Have a look at it, perhaps you may find something relevant there.

Typically you should look at the minimum time rather than the median, since timing noise is always positive. (I usually use @btime, which only reports the minimum.) When you do that, the conclusion is the opposite on my computer:

julia> @btime $a' * $a;
  49.545 Ī¼s (2 allocations: 78.20 KiB)

julia> @btime $b' * $a;
  30.643 Ī¼s (2 allocations: 78.20 KiB)

Similarly for a larger matrix:

julia> @btime $A' * $A;
  342.402 Ī¼s (2 allocations: 78.20 KiB)

julia> @btime $B' * $A;
  246.269 Ī¼s (2 allocations: 78.20 KiB)

It looks like the culprit is threading ā€” it seems that dgemm is better-parallelized than dysrk in OpenBLAS, probably because the computation in dgemm is more regular. If I change to single-threaded BLAS then dsyrk is faster in both cases:

julia> LinearAlgebra.BLAS.set_num_threads(1)

julia> @btime $A' * $A;
  487.042 Ī¼s (2 allocations: 78.20 KiB)

julia> @btime $B' * $A;
  602.739 Ī¼s (2 allocations: 78.20 KiB)

julia> @btime $a' * $a;
  58.816 Ī¼s (2 allocations: 78.20 KiB)

julia> @btime $b' * $a;
  61.733 Ī¼s (2 allocations: 78.20 KiB)
3 Likes

Thanks, I am not sure I fully understand the whole process yet, but it may be useful later.

1 Like

To step back a little bit, do you actually need to compute that matrix? If you just need a few entries or to apply it to vectors or something, something like this pseudo-julia might be helpful:

struct GramMatrix{T} 
  X::Matrix{T} # Or perhaps a QR factorization of X?
end

Base.getindex(G::GramMatrix{T}, j::Int64, k::Int64) where{T} = # small inner product of j/k slices
Base.*(G::GramMatrix{T}, x::Vector{T}) where{T} = # manual two-part matrix multiplication
Base.\(G::GramMatrix{T}, x::Vector{T}) where{T} = ... #, perhaps using IterativeSolvers.jl with your fast matvec

# ....

Of course if you need the whole matrix then this wonā€™t be enough, but particularly for very rectangular X you could potentially stand to save a lot with just a few lines of code like this if you only need something simple.

Sorry for the (very) late reply !

Actually, I am computing some reduced normal equations from covariance and cross covariance matrices.
So I have some CovX = X'*X,CrcovXY = X'*Y and CovY = Y'*Y matrices, which are then assembled as
Z = CovY - CrcovXY'*CovX*CrcovY.

Of course, this gets me in much numerical trouble and there seems to be a lot of error propagation.
For instance, though Z should theoretically be Hermitian, it is numerically not.

I read the articles you mention a bit more thoroughly. It seems that the available algorithms to perform the computation of the Cholesky decomposition of the Gramian matrix require to have a realization of a system (whether in continuous or discrete time). Because of that, I do not see how to take advantage of this (yet promising otherwise) approach.