Most Efficient Way to Compute a Quadratic Matrix Form

Given a Symmetric Positive Semi definite Matrix P , what would be the most efficient (Fast) way to calculate the matrix quadratic form:

{x}^{T} P x

For the following cases:

  1. The matrix P is dense.
  2. The matrix P is sparse.
  3. The matrix P is only symmetric (No definition).
1 Like

I think the answer will depend on how rank-deficient the matrix is (and its size), at least in the dense case. Is P reasonably close to full rank, or do you expect it to be severely degenerate?

If it is full rank, then I would claim that the fastest thing to do for (1) is sum(abs2, cholesky(P).L'*x). And if the matrix is low enough rank, then I’d be tempted to compute a very high precision low rank approximation and then put that in. But again, there’s obviously overhead with something like that and so things like the size of P and how degenerate it is will presumably be important.

EDIT: sorry, I always compute x^T P^{-1} x, not the multiplication, so my solution for (1) was an autopilot thing and not going to be best for multiplications. And I even forgot to put the transpose in. Not my most precise comment, sorry.

1 Like

I would just use dot(x, P, x) (to compute x^* P x; I’m not sure if you care whether it is conjugated in the complex case?), and worry about further micro-optimization only if benchmarks/profiling justifies the effort. This should be reasonably efficient in all cases even if it is not necessarily “optimal”.

4 Likes

You may assume it is something I found I need to optimize as much as possible.
As far as I know, the dot() won’t take advantage of mP being a symmetric matrix.
Is there an implementation which at least do that and gain some speed?

2 Likes

I have asked something similar recently (Efficient approach to multiply three matrices (M1*M2*M3) and two vectors and a matrix (x*M*y)).

Here’s an example where a matrix that is sufficiently large and rank-deficient benefits from something fancier:

using KernelMatrices, BenchmarkTools

const M = randn(50, 10_000);
const A = M'M;
const x = randn(10_000)

function fact_method(A,x)
  (U,V) = KernelMatrices.ACA(A, 1e-14)
  dot(U'*x, V'*x)
end

@btime dot($x, $A, $x)     # 27.1 ms on my machine w/ 4 LAPACK threads
@btime fact_method($A, $x) # 27.2 ms with same setup.

I picked a rank that is about the break-even speed, but if you make it lower than 50 then the fact_method wins.

This factorization is the adaptive cross approximation, which I think is reasonably comparable to an LU factorization with a fancy pivoting strategy, but which chooses the pivots based on partial information to avoid passing over the whole matrix. I’m not super knowledgeable about the lowest level details about pivoting strategies, though, so that might be incorrect. In any case, I implemented it here a long time ago. I don’t think it’s embarrassingly slow, but that package was one of my first big Julia projects and so while I haven’t seen anything better I bet somebody like @stevengj could sit down and write something much faster in the timespan of a coffee break.

I bring all this up to say that the code isn’t great and the factorization as currently implemented doesn’t specialize for symmetry, but even still for non-ridiculous matrix conditions it does beat the naive dot(x, P, x). I would bet that a smarter implementation could be even more competitive. So I do think that the answer will really depend on some parameters of the problem like size and rank.

1 Like

In this kind of problem, you’d be much better off not forming A explicitly in the first place…

I agree. If I had its factorization (Root, Cholesky, etc… or any other) it would have been easy.
But assume I don’t have it.

Obviously I’m not suggesting that you form the large rank-deficient square matrix and then re-factorize it. My point in that post is that the cost of the factorization might be worth it in some circumstances if the matrix is large enough rank deficient enough, and the matrix A is designed to be an example of that.

@RoyiAvital, can you comment on any more specifics for your problem, like the size and rank? I think without more specific information it’s hard to give particularly productive advice.

1 Like

I think your direction is very well understood, if the rank is low enough one could apply factorization to speed things up (Cholesky for P being Positive Definite, LDL in case it is only symmetric would be classic choice while KernelMatrices.jl may offer more advanced methods [See KernelMatrices.jl - A Software Package for Working with Hierarchical Matrices]).

What about the case the rank is high to make those method not feasible.
Is there any function which can take advantage of the structure of the problem?

1 Like

I appreciate the shout-out, although it’s really just the ACA that would be relevant for forming efficient low-rank representations like A = U V^T, and I have a feeling most people here could write a better ACA than what’s in there. So I would really only suggest anybody use that functionality as a cheap way to test for a proof of concept or something.

In any case, I don’t really have any ideas or knowledge about more thoughtful methods than the three-argument dot, at least at this level of generality and in a setting where factorizations aren’t helpful. But I’ll definitely be watching this thread, because I also find the topic interesting and post here in part because I’d love to see faster methods than the ones I currently use…

1 Like

It seems there is a native implementation inside the standard library which is quite fast.
At least according to the answer for An Efficient Way (Fastest) to Calculate a Quadratic Form in Julia.

2 Likes

For convenience I place it here (after some relabelling)

julia> using LinearAlgebra
julia> using BenchmarkTools

julia> x = rand(10000);
julia> A = rand(10000, 10000);

julia> P = (A + A') / 2;
julia> Pâ‚› = Symmetric(P);

julia> @btime dot($x, $P, $x);
  51.301 ms (0 allocations: 0 bytes)

julia> @btime dot($x, $Pâ‚›, $x);
  29.362 ms (0 allocations: 0 bytes)
4 Likes

So it seems things are in good shape for the Dense case.
Moreover, it seems that @Oscar_Smith found a way to make things even faster.

What about the sparse case? What does it use?

1 Like

I think it also depends if you need to do the calculation only once, or if you need to repeat it for a lot of different vectors x but fixed P. In the latter case paying the price to calculate a reduced rank approximation could be worth doing.

1 Like

I tried to replicate @Oscar_Smith 's result from the StackOverflow comment:

using BenchmarkTools;
using LinearAlgebra;
using LoopVectorization;

function QuadForm( vX :: Vector{T}, mA :: Matrix{T}, vY :: Vector{T} ) where {T <: AbstractFloat}

    (axes(vX)..., axes(vY)...) == axes(mA) || throw(DimensionMismatch());
    m, n = size(mA);
    s = zero(T);
    @tturbo for jj in 1:n
        yj = vY[jj];
        t = zero(T);
        for ii in 1:m
            t += mA[ii, jj] * vX[ii];
        end
        s += t * yj;
    end

    return s;

end

function QuadForm( vX :: Vector{T}, mA :: S, vY :: Vector{T} ) where {T <: AbstractFloat, S <: Symmetric{<: T, <: Matrix{<: T}}}
    # Slower than not using the Symmetric property

    (length(vX) == length(vY) == size(mA, 1)) || throw(DimensionMismatch())
    n = length(vY);
    s = zero(T);
    if mA.uplo == 'U'
        @inbounds for jj in 1:n
            @fastmath s += vX[jj] * mA[jj, jj] * vY[jj]; #<! Diagonal
            @inbounds for ii in 1:(jj - 1)
                @fastmath s += vX[ii] * mA[ii, jj] * vY[jj] + vX[jj] * mA[ii, jj] * vY[ii];
            end
        end
    else #<! if A.uplo == 'L'
        @inbounds for jj in 1:n
            @fastmath s += vX[jj] * mA[jj, jj] * vY[jj]; #<! Diagonal
            @inbounds for ii in (jj + 1):n
                @fastmath s += vX[ii] * mA[ii, jj] * vY[jj] + vX[jj] * mA[ii, jj] * vY[ii];
            end
        end
    end

    return s;

end

numRows = 10; numCols = 5; mA = randn(numRows, numCols); mB = Symmetric(randn(numRows, numRows)); mC = Matrix(mB); vX = randn(numRows); vY = randn(numCols);
numRows = 100; numCols = 50; mA = randn(numRows, numCols); mB = Symmetric(randn(numRows, numRows)); mC = Matrix(mB); vX = randn(numRows); vY = randn(numCols);
numRows = 1_000; numCols = 500; mA = randn(numRows, numCols); mB = Symmetric(randn(numRows, numRows)); mC = Matrix(mB); vX = randn(numRows); vY = randn(numCols);

@btime dot($vX, $mA, $vY)
@btime QuadForm($vX, $mA, $vY)
@btime dot($vX, $mB, $vX)
@btime QuadForm($vX, $mB, $vX) #<! Slow!
@btime dot($vX, $mC, $vX)
@btime QuadForm($vX, $mC, $vX) #<! Faster (No symmetry)

I’m on Ryzen 7940.
On my system the Symmetric code make no gains.
Up to 500 elements the LoopVectorization.jl code is 30-50% faster.

Is there anything else to optimize? Specifically for the symmetric case.
Since the inner loop depends on the outer loop, LP.jl is not valid.