Faster quadratic expression for symmetric matrices

corbat · December 19, 2022, 11:05am

Hi, I have a use case where I need to evaluate many quadratic expressions of the form x^T M x, with the matrix M always beeing symmetric. The typical dimensionality of M is between 3 and 3000.

Using the following two approaches for computing this expresssion,

product_1(x, M) = dot(x, M, x)

function product_2(x, M) 
	Mx = M * x
	return dot(x, Mx)
end

and when comparing Matrix{Float64} and Symmetric{Float64, Matrix{Float64}} matrices, I found the following performances:

I’m a bit surprised that the Symmetric case is not always faster. Also, I’m not sure how “bad” it is to make the tradeoff between computation time and memory allocations when using product_2 for large matrices (for D=1000: 1 allocation: ~ 8 KiB, compared to 0 allocations for product_1)

I also tried to implement this product myself, since due to the symmetry of M, only either the lower or the upper triangle of M needs to be considered. However, I have not been able to make this product as fast as the other two functions, even though I only iterate over half of the matrix.

function product_3(x, M)
    n = length(x)
    result = zero(eltype(x))

    @inbounds for i = 1:n
        result += x[i]^2 * M[i, i]

        @simd for j = 1 : i - 1
            result += 2* x[i] * x[j] * M[j, i]
        end 
    end

    return result
end

(product_3 is much slower and has a lot of allocations.)

Do you have any suggestions on how to get the best performance for this type of product with symmetric matrices? Is there a way around using different functions and datatypes for different sizes of M to get the best performance?

Oscar_Smith · December 19, 2022, 1:00pm

it might be worth trying mkl

stevengj · December 19, 2022, 1:57pm

The LinearAlgebra library already implements exactly such a method.

There was some discussion of the performance issues when this was implemented, where it was noted that this was not as fast (for large matrices) as the naive approach using the BLAS matrix–vector multiplication. See julia#32739 (add generalized dot product).

stevengj · December 19, 2022, 2:21pm

To do better here, you basically have to re-implement the BLAS dsymv (symmetrix matrix–vector product) function in dot(x, A, b).

Optimizing this kind of thing is non-trivial, but to give you an idea, here is an optimized (but not fully debugged) skew-symmetric matrix–vector product in Julia developed by Simon Mataigne with help from @Elrod: SkewLinearAlgebra.jl/skewsymmetric.jl at 54067de46fbe5dc3f1dd48546a64dc7b406b48b5 · JuliaLinearAlgebra/SkewLinearAlgebra.jl · GitHub

corbat · December 19, 2022, 3:49pm

Ah, sorry, I just noticed I didn’t properly replace one variable name and was accessing a global variable. product_3 is actually fine regarding allocations.

Here is the speed comparison including product_3:

Elrod · December 20, 2022, 4:24am

I figured a symmetric quadratic form would be a fun topic for a blog post.

For a 200x200 test problem (an admittedly convenient size), I got the runtime for a symmetric matrix down to <1 microsecond on my desktop.
For comparison, product_1 was 3.66 and 7 microseconds on dense and symmetric matrices, respectively, while product_3 was 5.4 and 8.9.

symv! and a skew-symmetric version are a little more complicated because lining up computations is more important when you need to store them in a particular destination.

Perhaps Simon could follow this up with a write up on skew-symmetric matrix-vector multiplication (or perhaps I’ll choose that as my excuse to procrastinate someday); I’d be happy to accept a PR at spmd_blog.

corbat · December 20, 2022, 12:59pm

Wow, thanks a lot for this very nice and detailed discussion of the problem.
Even though your code is way above my skill level, the blog entry was a very interesting read for me.
Looking forward to at some point use @turbo and letting “LoopModels” do all the magic

RoyiAvital · December 23, 2022, 6:04pm

@Elrod ,As always, this is really great, 2 questions:

To make the comparison wit no allocations on both side, is there a way to pre allocated combination of dsymv and matvec product in MKL?
What about the case of large matrices with multi threaded code?

By the way, in case of using the same matrix repeatedly, what I do is as following:

Decompose the matrix.
Use the decomposition in the form of a matrix vector with mul!() with pre allocated buffer.
Use dot() on the result (Maybe sum(abs2(), ...) is better, anyway, there should be more efficient way).

blackeneth · December 23, 2022, 6:49pm

Great blog post. A few comments and questions:

Typo?

CPI is cycles per instruction, the higher the value, the more clock cycles were required per instruction. We see that the execution stall rate was higher for product_4.

I think you mean product_3 here. You don’t define product_4 until later.

Question

What is a “rectangular loop”?
could you setup an RSS feed for your blog?

Oscar_Smith · December 23, 2022, 6:54pm

a rectangular loop is when yo have nested loops where the bounds don’t interact with each other. (i.e. the shape the loop variables form is a hyper-rectangle)

Elrod · December 27, 2022, 3:56am

Good catch, thanks.

I’ve not used rss before, but I defined the global variables mentioned here in this pr.

Oscar is correct about a rectangular loop.
The new LoopModels is going to support arbitrary polyhedra, and not just rectangles.
That still doesn’t cover everything, e.g. many operations with sparse or ragged arrays aren’t Polyhedra.

Same with loops calling functions that have iterative algorithms, like gcd, inside of them. I plan on adding some support for those eventually, too…
(Note that gcd currently does work with LoopVectorization, because it treats it as any old function, but if you manually inlined it, while loop and all, it won’t work)

Elrod · December 27, 2022, 4:14am

Sure:

julia> @btime product_8($x, $B)
  1.021 μs (0 allocations: 0 bytes)
560649.0855276876

julia> @btime product_2!($y, $x, $B)
  1.716 μs (0 allocations: 0 bytes)
560649.0855276876

julia> @btime product_2!($y, $x, $A)
  2.373 μs (0 allocations: 0 bytes)
560649.0855276876

Didn’t make much difference to minimum time.

That should just be memory bound, but multithreading could certainly help by increasing the effective size of your local caches (and increasing the total memory bandwidth you have access to).
Writing multithreaded versions would be a bit more cumbersome.
At some point, I’ll improve Polyester’s local variable support, to give it an option to work more like LoopVectorization’s (which supports non-allocating reductions).
Then we could Polyester.@batch the outer loop.

I guess the point of this is that you can dot(y,y) or sum(abs2, y) instead of dotting two different vectors?
dot is mostly limited by the rate at which you can load from the vectors…
But I’m not sure how much this will help, especially if you can fuse the operations like I did.

But x' * inv(S) * x is also a really common operation, and that’s obviously a really good idea there.

Also, if you have the same matrix repeatedly, any chance you can multiply an entire matrix, instead of one vector at a time?
symm! or trmm! are going to be way faster than a bunch of symv! or trmv!s.

RoyiAvital · December 27, 2022, 6:04am

@Elrod , Could you share the code for the non allocating versions?

My assumption is that in a mutli threaded scenario there will be gains in real world cases, hence it is a better representation. Maybe then different winner will emerge?

Indeed, when the one can, packing multiple mat vec operations into mat mat will reduce overhead. But sometimes the different vectors are available at different times, hence it is not viable.

Topic		Replies	Views
Product of two symmetric matrices: LoopVectorization.jl vs LinearAlgebra Performance blas , linearalgebra , loopvectorization	9	975	August 31, 2021
Slow sparse matrix-vector product with symmetric matrices General Usage	34	4389	July 28, 2019
SymTrididiagonal matrices performance and views allocations Performance	12	891	November 8, 2019
Speeding up Matrix multiplication involving dot and hadamard product New to Julia question , performance , vector , matrices , loopvectorization	11	1643	February 9, 2022
Matrix product `M * transpose(M)` General Usage performance , linearalgebra , tullio , loopvectorization	10	2081	December 8, 2022

Faster quadratic expression for symmetric matrices

Related topics