Faster quadratic expression for symmetric matrices

Hi, I have a use case where I need to evaluate many quadratic expressions of the form x^T M x, with the matrix M always beeing symmetric. The typical dimensionality of M is between 3 and 3000.

Using the following two approaches for computing this expresssion,

product_1(x, M) = dot(x, M, x)

function product_2(x, M) 
	Mx = M * x
	return dot(x, Mx)
end 

and when comparing Matrix{Float64} and Symmetric{Float64, Matrix{Float64}} matrices, I found the following performances:

I’m a bit surprised that the Symmetric case is not always faster. Also, I’m not sure how “bad” it is to make the tradeoff between computation time and memory allocations when using product_2 for large matrices (for D=1000: 1 allocation: ~ 8 KiB, compared to 0 allocations for product_1)

I also tried to implement this product myself, since due to the symmetry of M, only either the lower or the upper triangle of M needs to be considered. However, I have not been able to make this product as fast as the other two functions, even though I only iterate over half of the matrix.

function product_3(x, M)
    n = length(x)
    result = zero(eltype(x))

    @inbounds for i = 1:n
        result += x[i]^2 * M[i, i]

        @simd for j = 1 : i - 1
            result += 2* x[i] * x[j] * M[j, i]
        end 
    end

    return result
end

(product_3 is much slower and has a lot of allocations.)

Do you have any suggestions on how to get the best performance for this type of product with symmetric matrices? Is there a way around using different functions and datatypes for different sizes of M to get the best performance?

2 Likes

it might be worth trying mkl

The LinearAlgebra library already implements exactly such a method.

There was some discussion of the performance issues when this was implemented, where it was noted that this was not as fast (for large matrices) as the naive approach using the BLAS matrix–vector multiplication. See julia#32739 (add generalized dot product).

To do better here, you basically have to re-implement the BLAS dsymv (symmetrix matrix–vector product) function in dot(x, A, b).

Optimizing this kind of thing is non-trivial, but to give you an idea, here is an optimized (but not fully debugged) skew-symmetric matrix–vector product in Julia developed by Simon Mataigne with help from @Elrod: SkewLinearAlgebra.jl/skewsymmetric.jl at 54067de46fbe5dc3f1dd48546a64dc7b406b48b5 · JuliaLinearAlgebra/SkewLinearAlgebra.jl · GitHub

2 Likes

Ah, sorry, I just noticed I didn’t properly replace one variable name and was accessing a global variable. product_3 is actually fine regarding allocations.

Here is the speed comparison including product_3:

1 Like

I figured a symmetric quadratic form would be a fun topic for a blog post.

For a 200x200 test problem (an admittedly convenient size), I got the runtime for a symmetric matrix down to <1 microsecond on my desktop.
For comparison, product_1 was 3.66 and 7 microseconds on dense and symmetric matrices, respectively, while product_3 was 5.4 and 8.9.

symv! and a skew-symmetric version are a little more complicated because lining up computations is more important when you need to store them in a particular destination.

Perhaps Simon could follow this up with a write up on skew-symmetric matrix-vector multiplication (or perhaps I’ll choose that as my excuse to procrastinate someday); I’d be happy to accept a PR at spmd_blog.

10 Likes

Wow, thanks a lot for this very nice and detailed discussion of the problem.
Even though your code is way above my skill level, the blog entry was a very interesting read for me.
Looking forward to at some point use @turbo and letting “LoopModels” do all the magic :wink:

@Elrod ,As always, this is really great, 2 questions:

  1. To make the comparison wit no allocations on both side, is there a way to pre allocated combination of dsymv and matvec product in MKL?
  2. What about the case of large matrices with multi threaded code?

By the way, in case of using the same matrix repeatedly, what I do is as following:

  1. Decompose the matrix.
  2. Use the decomposition in the form of a matrix vector with mul!() with pre allocated buffer.
  3. Use dot() on the result (Maybe sum(abs2(), ...) is better, anyway, there should be more efficient way).

Great blog post. A few comments and questions:

Typo?

CPI is cycles per instruction, the higher the value, the more clock cycles were required per instruction. We see that the execution stall rate was higher for product_4.

I think you mean product_3 here. You don’t define product_4 until later.

Question

  • What is a “rectangular loop”?

  • could you setup an RSS feed for your blog?

1 Like

a rectangular loop is when yo have nested loops where the bounds don’t interact with each other. (i.e. the shape the loop variables form is a hyper-rectangle)

2 Likes

Good catch, thanks.

I’ve not used rss before, but I defined the global variables mentioned here in this pr.

Oscar is correct about a rectangular loop.
The new LoopModels is going to support arbitrary polyhedra, and not just rectangles.
That still doesn’t cover everything, e.g. many operations with sparse or ragged arrays aren’t Polyhedra.

Same with loops calling functions that have iterative algorithms, like gcd, inside of them. I plan on adding some support for those eventually, too…
(Note that gcd currently does work with LoopVectorization, because it treats it as any old function, but if you manually inlined it, while loop and all, it won’t work)

Sure:

julia> @btime product_8($x, $B)
  1.021 μs (0 allocations: 0 bytes)
560649.0855276876

julia> @btime product_2!($y, $x, $B)
  1.716 μs (0 allocations: 0 bytes)
560649.0855276876

julia> @btime product_2!($y, $x, $A)
  2.373 μs (0 allocations: 0 bytes)
560649.0855276876

Didn’t make much difference to minimum time.

That should just be memory bound, but multithreading could certainly help by increasing the effective size of your local caches (and increasing the total memory bandwidth you have access to).
Writing multithreaded versions would be a bit more cumbersome.
At some point, I’ll improve Polyester’s local variable support, to give it an option to work more like LoopVectorization’s (which supports non-allocating reductions).
Then we could Polyester.@batch the outer loop.

I guess the point of this is that you can dot(y,y) or sum(abs2, y) instead of dotting two different vectors?
dot is mostly limited by the rate at which you can load from the vectors…
But I’m not sure how much this will help, especially if you can fuse the operations like I did.

But x' * inv(S) * x is also a really common operation, and that’s obviously a really good idea there.

Also, if you have the same matrix repeatedly, any chance you can multiply an entire matrix, instead of one vector at a time?
symm! or trmm! are going to be way faster than a bunch of symv! or trmv!s.

@Elrod , Could you share the code for the non allocating versions?

My assumption is that in a mutli threaded scenario there will be gains in real world cases, hence it is a better representation. Maybe then different winner will emerge?

Indeed, when the one can, packing multiple mat vec operations into mat mat will reduce overhead. But sometimes the different vectors are available at different times, hence it is not viable.