Most Efficient Way to Compute a Quadratic Matrix Form

RoyiAvital · August 18, 2021, 3:04pm

Given a Symmetric Positive Semi definite Matrix P , what would be the most efficient (Fast) way to calculate the matrix quadratic form:

{x}^{T} P x

For the following cases:

The matrix P is dense.
The matrix P is sparse.
The matrix P is only symmetric (No definition).

cgeoga · August 18, 2021, 3:34pm

I think the answer will depend on how rank-deficient the matrix is (and its size), at least in the dense case. Is P reasonably close to full rank, or do you expect it to be severely degenerate?

~~If it is full rank, then I would claim that the fastest thing to do for (1) is sum(abs2, cholesky(P).L'*x)~~. And if the matrix is low enough rank, then I’d be tempted to compute a very high precision low rank approximation and then put that in. But again, there’s obviously overhead with something like that and so things like the size of P and how degenerate it is will presumably be important.

EDIT: sorry, I always compute x^T P^{-1} x, not the multiplication, so my solution for (1) was an autopilot thing and not going to be best for multiplications. And I even forgot to put the transpose in. Not my most precise comment, sorry.

stevengj · August 18, 2021, 5:40pm

I would just use dot(x, P, x) (to compute x^* P x; I’m not sure if you care whether it is conjugated in the complex case?), and worry about further micro-optimization only if benchmarks/profiling justifies the effort. This should be reasonably efficient in all cases even if it is not necessarily “optimal”.

RoyiAvital · August 18, 2021, 5:58pm

You may assume it is something I found I need to optimize as much as possible.
As far as I know, the dot() won’t take advantage of mP being a symmetric matrix.
Is there an implementation which at least do that and gain some speed?

fipelle · August 18, 2021, 6:19pm

I have asked something similar recently (Efficient approach to multiply three matrices (M1*M2*M3) and two vectors and a matrix (x*M*y)).

cgeoga · August 18, 2021, 6:26pm

Here’s an example where a matrix that is sufficiently large and rank-deficient benefits from something fancier:

using KernelMatrices, BenchmarkTools

const M = randn(50, 10_000);
const A = M'M;
const x = randn(10_000)

function fact_method(A,x)
  (U,V) = KernelMatrices.ACA(A, 1e-14)
  dot(U'*x, V'*x)
end

@btime dot($x, $A, $x)     # 27.1 ms on my machine w/ 4 LAPACK threads
@btime fact_method($A, $x) # 27.2 ms with same setup.

I picked a rank that is about the break-even speed, but if you make it lower than 50 then the fact_method wins.

This factorization is the adaptive cross approximation, which I think is reasonably comparable to an LU factorization with a fancy pivoting strategy, but which chooses the pivots based on partial information to avoid passing over the whole matrix. I’m not super knowledgeable about the lowest level details about pivoting strategies, though, so that might be incorrect. In any case, I implemented it here a long time ago. I don’t think it’s embarrassingly slow, but that package was one of my first big Julia projects and so while I haven’t seen anything better I bet somebody like @stevengj could sit down and write something much faster in the timespan of a coffee break.

I bring all this up to say that the code isn’t great and the factorization as currently implemented doesn’t specialize for symmetry, but even still for non-ridiculous matrix conditions it does beat the naive dot(x, P, x). I would bet that a smarter implementation could be even more competitive. So I do think that the answer will really depend on some parameters of the problem like size and rank.

stevengj · August 18, 2021, 7:07pm

In this kind of problem, you’d be much better off not forming A explicitly in the first place…

RoyiAvital · August 18, 2021, 7:08pm

I agree. If I had its factorization (Root, Cholesky, etc… or any other) it would have been easy.
But assume I don’t have it.

cgeoga · August 18, 2021, 7:18pm

Obviously I’m not suggesting that you form the large rank-deficient square matrix and then re-factorize it. My point in that post is that the cost of the factorization might be worth it in some circumstances if the matrix is large enough rank deficient enough, and the matrix A is designed to be an example of that.

@RoyiAvital, can you comment on any more specifics for your problem, like the size and rank? I think without more specific information it’s hard to give particularly productive advice.

RoyiAvital · August 18, 2021, 7:38pm

I think your direction is very well understood, if the rank is low enough one could apply factorization to speed things up (Cholesky for P being Positive Definite, LDL in case it is only symmetric would be classic choice while KernelMatrices.jl may offer more advanced methods [See KernelMatrices.jl - A Software Package for Working with Hierarchical Matrices]).

What about the case the rank is high to make those method not feasible.
Is there any function which can take advantage of the structure of the problem?

cgeoga · August 18, 2021, 8:02pm

I appreciate the shout-out, although it’s really just the ACA that would be relevant for forming efficient low-rank representations like A = U V^T, and I have a feeling most people here could write a better ACA than what’s in there. So I would really only suggest anybody use that functionality as a cheap way to test for a proof of concept or something.

In any case, I don’t really have any ideas or knowledge about more thoughtful methods than the three-argument dot, at least at this level of generality and in a setting where factorizations aren’t helpful. But I’ll definitely be watching this thread, because I also find the topic interesting and post here in part because I’d love to see faster methods than the ones I currently use…

EricJohnson · January 10, 2022, 2:33pm

It seems there is a native implementation inside the standard library which is quite fast.
At least according to the answer for An Efficient Way (Fastest) to Calculate a Quadratic Form in Julia.

zdenek_hurak · January 10, 2022, 3:18pm

For convenience I place it here (after some relabelling)

julia> using LinearAlgebra
julia> using BenchmarkTools

julia> x = rand(10000);
julia> A = rand(10000, 10000);

julia> P = (A + A') / 2;
julia> Pₛ = Symmetric(P);

julia> @btime dot($x, $P, $x);
  51.301 ms (0 allocations: 0 bytes)

julia> @btime dot($x, $Pₛ, $x);
  29.362 ms (0 allocations: 0 bytes)

RoyiAvital · January 11, 2022, 7:11am

So it seems things are in good shape for the Dense case.
Moreover, it seems that @Oscar_Smith found a way to make things even faster.

What about the sparse case? What does it use?

JM_Beckers · January 11, 2022, 11:45am

I think it also depends if you need to do the calculation only once, or if you need to repeat it for a lot of different vectors x but fixed P. In the latter case paying the price to calculate a reduced rank approximation could be worth doing.

RoyiAvital · October 27, 2024, 6:24pm

I tried to replicate @Oscar_Smith 's result from the StackOverflow comment:

using BenchmarkTools;
using LinearAlgebra;
using LoopVectorization;

function QuadForm( vX :: Vector{T}, mA :: Matrix{T}, vY :: Vector{T} ) where {T <: AbstractFloat}

    (axes(vX)..., axes(vY)...) == axes(mA) || throw(DimensionMismatch());
    m, n = size(mA);
    s = zero(T);
    @tturbo for jj in 1:n
        yj = vY[jj];
        t = zero(T);
        for ii in 1:m
            t += mA[ii, jj] * vX[ii];
        end
        s += t * yj;
    end

    return s;

end

function QuadForm( vX :: Vector{T}, mA :: S, vY :: Vector{T} ) where {T <: AbstractFloat, S <: Symmetric{<: T, <: Matrix{<: T}}}
    # Slower than not using the Symmetric property

    (length(vX) == length(vY) == size(mA, 1)) || throw(DimensionMismatch())
    n = length(vY);
    s = zero(T);
    if mA.uplo == 'U'
        @inbounds for jj in 1:n
            @fastmath s += vX[jj] * mA[jj, jj] * vY[jj]; #<! Diagonal
            @inbounds for ii in 1:(jj - 1)
                @fastmath s += vX[ii] * mA[ii, jj] * vY[jj] + vX[jj] * mA[ii, jj] * vY[ii];
            end
        end
    else #<! if A.uplo == 'L'
        @inbounds for jj in 1:n
            @fastmath s += vX[jj] * mA[jj, jj] * vY[jj]; #<! Diagonal
            @inbounds for ii in (jj + 1):n
                @fastmath s += vX[ii] * mA[ii, jj] * vY[jj] + vX[jj] * mA[ii, jj] * vY[ii];
            end
        end
    end

    return s;

end

numRows = 10; numCols = 5; mA = randn(numRows, numCols); mB = Symmetric(randn(numRows, numRows)); mC = Matrix(mB); vX = randn(numRows); vY = randn(numCols);
numRows = 100; numCols = 50; mA = randn(numRows, numCols); mB = Symmetric(randn(numRows, numRows)); mC = Matrix(mB); vX = randn(numRows); vY = randn(numCols);
numRows = 1_000; numCols = 500; mA = randn(numRows, numCols); mB = Symmetric(randn(numRows, numRows)); mC = Matrix(mB); vX = randn(numRows); vY = randn(numCols);

@btime dot($vX, $mA, $vY)
@btime QuadForm($vX, $mA, $vY)
@btime dot($vX, $mB, $vX)
@btime QuadForm($vX, $mB, $vX) #<! Slow!
@btime dot($vX, $mC, $vX)
@btime QuadForm($vX, $mC, $vX) #<! Faster (No symmetry)

I’m on Ryzen 7940.
On my system the Symmetric code make no gains.
Up to 500 elements the LoopVectorization.jl code is 30-50% faster.

Is there anything else to optimize? Specifically for the symmetric case.
Since the inner loop depends on the outer loop, LP.jl is not valid.