# Speeding up Matrix multiplication involving dot and hadamard product

Hi everyone,

I am looking for suggestions to speed up the following operation, which I need to frequently evaluate to construct a Jacobian matrix, and which accounts for approx. 80% of computation time for said Jacobian.

The expression I am evaluating is `(S .* (S * (d .* S))) * X` where

• S is a n x n symmetric matrix
• d is a n x 1 column vector
• X is a n x p matrix
• n>>p, e.g. n = 100, p = 10

I tried a naive loop-implementation and one using LoopVectorization and Tullio, but both are substantially slower than the naive evaluation of my expression. Below is a code example:

``````using BenchmarkTools, Random, Tullio, LoopVectorization;

Random.seed!(1);
n = 100;
p = 10;

S = randn(n,n);
S = S'*S;
d = randn(n);
X = randn(n,p);

function naive_loop(S::Array{T,2},d::Vector{T},X::Array{T,2}) where T
out = similar(X)
@inbounds for i in 1:size(S,1)
@inbounds for l in 1:size(X,2)
out[i,l] = naive_inner_loop(S,d,X,i,l)
end
end
out
end

function naive_inner_loop(S::Array{T,2},d::Vector{T},X::Array{T,2},i::Int,l::Int) where T
r = zero(T)
@inbounds for j in 1:size(S,1)
@inbounds for k in 1:size(S,1)
r += S[i,j]*S[i,k]*S[j,k]*d[k]*X[j,l]
end
end
r
end

function tullio_loop(S::Array{T,2},d::Vector{T},X::Array{T,2}) where T
@tullio out[i,l] := @inbounds S[j,i]*S[k,i]*S[j,k]*d[k]*X[j,l]
end

@assert naive_loop(S,d,X) ≈ tullio_loop(S,d,X);
@assert tullio_loop(S,d,X) ≈ (S .* (S * (d .* S))) * X ;

@btime naive_loop(\$S,\$d,\$X) ;
# 10.543 ms (1 allocation: 7.94 KiB)
@btime tullio_loop(\$S,\$d,\$X) ;
#  1.306 ms (50 allocations: 10.56 KiB)
@btime (\$S .* (\$S * (\$d .* \$S))) * \$X;
# 81.802 μs (7 allocations: 242.45 KiB)
``````

Any help on how to speed up this operation is much appreciated.

Best regards,

Philipp

Given that is calling mostly optimized linear algebra BLAS operations, and they are already multithreaded, it is hard to beat it. You can save some allocations by using inplace operations, for example:

``````julia> buff2 = zeros(100,10); buff = zeros(100,100);

julia> @btime mul!(\$buff2,(\$S .* mul!(\$buff,\$S,(\$d .* \$S))),\$X);
71.450 μs (4 allocations: 156.34 KiB)

``````

but that does not speedp significantly anything. I’m not sure if you can remove all allocations that way.

What does speedup is switching to MKL:

``````julia> using MKL

julia> @btime (\$S .* (\$S * (\$d .* \$S))) * \$X;
30.955 μs (7 allocations: 242.45 KiB)

``````

Indeed, I was not sure if there was some clever way to exploit the structure of the problem that could speed things up a little.

That is a great spot, thanks for this! I guess my best shot is to try and reduce allocations in conjunction with MKL.

Note BTW that this is O(N^4), from nested loops over `i,j,k,l`, while ordinary matmul is O(N^3).

There’s a chance that doing `@tullio buff[j,i] = S[j,i]*S[k,i]*S[j,k]*d[k]` followed by mamtul with `X` (two N^3 operations in series) will be competitive. The relative speeds will depend on your computer, and what BLAS library you are using:

``````julia> using Tullio, LoopVectorization

julia> function two_cubes(S, d, X, buff, buff2)
@tullio buff[j,i] = S[j,i]*S[k,i]*S[j,k]*d[k]
mul!(buff2, buff, X)
end;

julia> @btime two_cubes(\$S, \$d, \$X, \$buff, \$buff2);
min 59.292 μs, mean 86.212 μs (50 allocations, 2.47 KiB)
``````

Re avoiding allocations, one more buffer gets you to zero:

``````julia> @btime (\$S .* (\$S * (\$d .* \$S))) * \$X;
min 105.750 μs, mean 333.542 μs (7 allocations, 242.45 KiB)  # openblas
min 25.917 μs, mean 48.494 μs (7 allocations, 242.45 KiB)  # apple's

julia> buff2 = zeros(100,10); buff = zeros(100,100);

julia> @btime mul!(\$buff2,(\$S .* mul!(\$buff,\$S,(\$d .* \$S))),\$X);
min 107.000 μs, mean 332.390 μs (4 allocations, 156.34 KiB)
min 24.459 μs, mean 45.467 μs (4 allocations, 156.34 KiB)

julia> buff3 = zeros(100,100);

julia> @btime mul!(\$buff2,(\$buff3 .= \$S .* mul!(\$buff,\$S,(\$buff3 .= \$d .* \$S))),\$X);
min 98.917 μs, mean 268.524 μs (0 allocations)  # openblas
min 22.875 μs, mean 23.701 μs (0 allocations)  # apple
``````

Edit: Note also that you can wrap S in `Symmetric` to get some routines specialised for such matrices. However, only one of them matters, and it isn’t always faster:

``````julia> issymmetric(S)
true

julia> @btime mul!(\$buff2,(\$buff3 .= \$S .* mul!(\$buff, Symmetric(\$S), (\$buff3 .= \$d .* \$S))),\$X);
min 99.125 μs, mean 246.848 μs (0 allocations)  # openblas
min 72.834 μs, mean 92.357 μs (0 allocations)  # apple
``````

And for completeness, zero-alloc tturbo version from below:

``````julia> out=similar(X); temp1=similar(S); temp2=similar(S);

julia> @btime full_comput!(\$out, \$temp1, \$temp2, \$S, \$d, \$X);  # from @romainvieme below
min 25.584 μs, mean 28.031 μs (0 allocations)
``````
2 Likes

Another path is to try `@tturbo` on the complete loop. You may be able to improve on that by adjusting the order in which the indexes are run (although I think the macro already does that). In any case, at least here the single allocation observed is of course the `out` vector and then you could do this non-allocating easily. The performance is not as good as that of the other alternatives though.

``````julia> using LoopVectorization

julia> function naive_loop(S::Array{T,2},d::Vector{T},X::Array{T,2}) where T
out = similar(X)
@tturbo for i in 1:size(S,1)
for l in 1:size(X,2)
r = zero(T)
for j in 1:size(S,1)
for k in 1:size(S,1)
r += S[i,j]*S[i,k]*S[j,k]*d[k]*X[j,l]
end
end
out[i,l] += r
end
end
out
end
naive_loop (generic function with 1 method)

julia> @btime naive_loop(\$S,\$d,\$X);
243.603 μs (1 allocation: 7.94 KiB)

``````

ps: `Array{T,2}` can be written as `Matrix{T}` (and usually it is better to use `AbstractVector` and `AbstractMatrix` such that the function accepts views, etc.

1 Like

Hello, on my machine using `@tturbo` and buffers slightly outspeeds MKL

``````using LoopVectorization, LinearAlgebra, Random, BenchmarkTools, MKL

function hadamard!(out, S::Array{T, 2}, d::Array{T, 1}) where T
@tturbo for i=1:size(S, 1)
for j=1:size(S, 2)
out[i, j]=d[i]*S[i, j]
end
end
end
function hadamard!(out, U::Array{T, 2}, V::Array{T,2}) where T
@tturbo for i=1:size(U, 1)
for j=1:size(U,2)
out[i, j]=U[i, j]*V[i, j]
end
end
end

#temp1 and temp2 are similar to S
function full_comput!(out, temp1,temp2,  S, d, X)
mul!(temp2, S, temp1)
mul!(out, temp1, X)
end

Random.seed!(1);
n = 100;
p = 10;

S = randn(n,n);
S = S'*S;
d = randn(n);
X = randn(n,p);

out=similar(X); temp1=similar(S); temp2=similar(S);
``````

and with `julia -t 16`

``````julia> @btime full_comput!(\$out, \$temp1, \$temp2, \$S, \$d, \$X);
32.738 μs (0 allocations: 0 bytes)

julia> @btime (\$S .* (\$S * (\$d .* \$S))) * \$X;
36.222 μs (7 allocations: 242.45 KiB)
``````
1 Like

Maybe LoopVectorization.jl can reorder things, but it is normally faster to iterate down columns, not along rows.

1 Like

Thanks for the rhelp!

Indeed, since I have no understanding of the workings of `@tullio`, I was hoping for some magic that did not materialize.

I tried this and with results close to `(\$S .* (\$S * (\$d .* \$S))) * \$X` but then I figured I might as well go with that.

What I find interesting that even fmaking S symmetric first does not improve performance here.

``````SS = Symmetric(S);
julia> @btime mul!(\$buff2,(\$buff3 .= \$S .* mul!(\$buff, \$SS, (\$buff3 .= \$d .* \$S))),\$X);
77.914 μs (0 allocations: 0 bytes)
``````

zero-alloc seems to be the way to go, thanks for the comments!

You’re right ! It’s even faster if I switch the two for loops.

``````julia> @btime full_comput!(\$out, \$temp1, \$temp2, \$S, \$d, \$X);
29.375 μs (0 allocations: 0 bytes)
``````

Thanks!

It may be… but I think `@turbo` reorders the loops, so that is probably noise. It won’t harm, though.

That looks pretty good too! I’m using a function along the lines of hadamrd! in my optimization so I might try this approach.

I guess I have to thoroughly benchmark all approaches and check how the suggestions integrate into the whole optimization routine.

Thanks for the help all!

Yes, it should.

I am planning on implementing the O(N^4) → O(N^3) optimization in the rewrite because code like this keeps coming up.
Fwiw, there’s plenty of literature on this, e.g. [2010.03074] On Simplifying Dependent Polyhedral Reductions

1 Like