Fast computation of Grammian of matrix

For some model indentification problems, I have to compute the grammian of data matrices i.e. compute A = X'*X where X is a complex rectangular matrix.

Is there a fast implementation of this specific operation in julia knowing that the output matrix is hermitian ?

A first basic assessment would assume this could split in half the computation times as well as the memory requirements. Also this could reduce numerical error propagation leading to a non heremitian numeric output.

Try to search for “kernel matrices julia” and you may find something useful. For example, I found KernelMatrices.jl by @cgeoga

You may also find something if you search for “quadratic form julia”, I don’t know.

1 Like

Some quick googling finds the Julia benevolent overlords had already gotten you covered almost 10 years ago

using BenchmarkTools
a = rand(100,100); b = rand(100,100);

@benchmark $a' * $a
# median time:      110.190 μs (0.00% GC)

@benchmark $b' * $a
# median time:      836.717 μs (0.00% GC)
2 Likes

Just a remark: it is often advantageous from a numerical viewpoint to use algorithms that do not require forming the gramian and instead work with its square root (Cholesky) factor.

@zdenek_hurak could you expand on your comment ? What type of algorithms do you refer to ?

@FMeirinhos thanks, sometimes it’s just a matter of keywords !

I meant something in the line of square-root balancing used for model order reduction (https://doi.org/10.1016/S1474-6670(17)49168-3 or https://doi.org/10.1109/CDC.1991.261486 and certainly described elsewhere). Perhaps it may be relevant for you problem, maybe not. Just in case, the author has been actively developing MatrixEquations.jl package. Have a look at it, perhaps you may find something relevant there.

Typically you should look at the minimum time rather than the median, since timing noise is always positive. (I usually use @btime, which only reports the minimum.) When you do that, the conclusion is the opposite on my computer:

julia> @btime $a' * $a;
  49.545 μs (2 allocations: 78.20 KiB)

julia> @btime $b' * $a;
  30.643 μs (2 allocations: 78.20 KiB)

Similarly for a larger matrix:

julia> @btime $A' * $A;
  342.402 μs (2 allocations: 78.20 KiB)

julia> @btime $B' * $A;
  246.269 μs (2 allocations: 78.20 KiB)

It looks like the culprit is threading — it seems that dgemm is better-parallelized than dysrk in OpenBLAS, probably because the computation in dgemm is more regular. If I change to single-threaded BLAS then dsyrk is faster in both cases:

julia> LinearAlgebra.BLAS.set_num_threads(1)

julia> @btime $A' * $A;
  487.042 μs (2 allocations: 78.20 KiB)

julia> @btime $B' * $A;
  602.739 μs (2 allocations: 78.20 KiB)

julia> @btime $a' * $a;
  58.816 μs (2 allocations: 78.20 KiB)

julia> @btime $b' * $a;
  61.733 μs (2 allocations: 78.20 KiB)
3 Likes

Thanks, I am not sure I fully understand the whole process yet, but it may be useful later.

1 Like

To step back a little bit, do you actually need to compute that matrix? If you just need a few entries or to apply it to vectors or something, something like this pseudo-julia might be helpful:

struct GramMatrix{T} 
  X::Matrix{T} # Or perhaps a QR factorization of X?
end

Base.getindex(G::GramMatrix{T}, j::Int64, k::Int64) where{T} = # small inner product of j/k slices
Base.*(G::GramMatrix{T}, x::Vector{T}) where{T} = # manual two-part matrix multiplication
Base.\(G::GramMatrix{T}, x::Vector{T}) where{T} = ... #, perhaps using IterativeSolvers.jl with your fast matvec

# ....

Of course if you need the whole matrix then this won’t be enough, but particularly for very rectangular X you could potentially stand to save a lot with just a few lines of code like this if you only need something simple.

Sorry for the (very) late reply !

Actually, I am computing some reduced normal equations from covariance and cross covariance matrices.
So I have some CovX = X'*X,CrcovXY = X'*Y and CovY = Y'*Y matrices, which are then assembled as
Z = CovY - CrcovXY'*CovX*CrcovY.

Of course, this gets me in much numerical trouble and there seems to be a lot of error propagation.
For instance, though Z should theoretically be Hermitian, it is numerically not.

I read the articles you mention a bit more thoroughly. It seems that the available algorithms to perform the computation of the Cholesky decomposition of the Gramian matrix require to have a realization of a system (whether in continuous or discrete time). Because of that, I do not see how to take advantage of this (yet promising otherwise) approach.