Speeding up Matrix multiplication involving dot and hadamard product

Hi everyone,

I am looking for suggestions to speed up the following operation, which I need to frequently evaluate to construct a Jacobian matrix, and which accounts for approx. 80% of computation time for said Jacobian.

The expression I am evaluating is (S .* (S * (d .* S))) * X where

  • S is a n x n symmetric matrix
  • d is a n x 1 column vector
  • X is a n x p matrix
  • n>>p, e.g. n = 100, p = 10

I tried a naive loop-implementation and one using LoopVectorization and Tullio, but both are substantially slower than the naive evaluation of my expression. Below is a code example:

using BenchmarkTools, Random, Tullio, LoopVectorization; 

Random.seed!(1); 
n = 100; 
p = 10; 

S = randn(n,n); 
S = S'*S; 
d = randn(n); 
X = randn(n,p); 

function naive_loop(S::Array{T,2},d::Vector{T},X::Array{T,2}) where T 
    out = similar(X) 
    @inbounds for i in 1:size(S,1) 
        @inbounds for l in 1:size(X,2)
            out[i,l] = naive_inner_loop(S,d,X,i,l) 
        end 
    end 
    out 
end 

function naive_inner_loop(S::Array{T,2},d::Vector{T},X::Array{T,2},i::Int,l::Int) where T 
    r = zero(T) 
    @inbounds for j in 1:size(S,1) 
        @inbounds for k in 1:size(S,1) 
            r += S[i,j]*S[i,k]*S[j,k]*d[k]*X[j,l] 
        end 
    end 
    r 
end 

function tullio_loop(S::Array{T,2},d::Vector{T},X::Array{T,2}) where T 
    @tullio out[i,l] := @inbounds S[j,i]*S[k,i]*S[j,k]*d[k]*X[j,l] 
end 

@assert naive_loop(S,d,X) ≈ tullio_loop(S,d,X);  
@assert tullio_loop(S,d,X) ≈ (S .* (S * (d .* S))) * X ; 

@btime naive_loop($S,$d,$X) ; 
# 10.543 ms (1 allocation: 7.94 KiB)
@btime tullio_loop($S,$d,$X) ; 
#  1.306 ms (50 allocations: 10.56 KiB)
@btime ($S .* ($S * ($d .* $S))) * $X;  
# 81.802 μs (7 allocations: 242.45 KiB)

Any help on how to speed up this operation is much appreciated.

Best regards,

Philipp

Given that is calling mostly optimized linear algebra BLAS operations, and they are already multithreaded, it is hard to beat it. You can save some allocations by using inplace operations, for example:

julia> buff2 = zeros(100,10); buff = zeros(100,100);

julia> @btime mul!($buff2,($S .* mul!($buff,$S,($d .* $S))),$X);
  71.450 μs (4 allocations: 156.34 KiB)

but that does not speedp significantly anything. I’m not sure if you can remove all allocations that way.

What does speedup is switching to MKL:

julia> using MKL

julia> @btime ($S .* ($S * ($d .* $S))) * $X;
  30.955 μs (7 allocations: 242.45 KiB)

Thanks for the quick reply!

Indeed, I was not sure if there was some clever way to exploit the structure of the problem that could speed things up a little.

That is a great spot, thanks for this! I guess my best shot is to try and reduce allocations in conjunction with MKL.

Note BTW that this is O(N^4), from nested loops over i,j,k,l, while ordinary matmul is O(N^3).

There’s a chance that doing @tullio buff[j,i] = S[j,i]*S[k,i]*S[j,k]*d[k] followed by mamtul with X (two N^3 operations in series) will be competitive. The relative speeds will depend on your computer, and what BLAS library you are using:

julia> using Tullio, LoopVectorization

julia> function two_cubes(S, d, X, buff, buff2)
       @tullio buff[j,i] = S[j,i]*S[k,i]*S[j,k]*d[k]
       mul!(buff2, buff, X)
       end;

julia> @btime two_cubes($S, $d, $X, $buff, $buff2);
  min 59.292 μs, mean 86.212 μs (50 allocations, 2.47 KiB)

Re avoiding allocations, one more buffer gets you to zero:

julia> @btime ($S .* ($S * ($d .* $S))) * $X;  
  min 105.750 μs, mean 333.542 μs (7 allocations, 242.45 KiB)  # openblas
  min 25.917 μs, mean 48.494 μs (7 allocations, 242.45 KiB)  # apple's

julia> buff2 = zeros(100,10); buff = zeros(100,100);

julia> @btime mul!($buff2,($S .* mul!($buff,$S,($d .* $S))),$X);
  min 107.000 μs, mean 332.390 μs (4 allocations, 156.34 KiB)
  min 24.459 μs, mean 45.467 μs (4 allocations, 156.34 KiB)

julia> buff3 = zeros(100,100);

julia> @btime mul!($buff2,($buff3 .= $S .* mul!($buff,$S,($buff3 .= $d .* $S))),$X);
  min 98.917 μs, mean 268.524 μs (0 allocations)  # openblas
  min 22.875 μs, mean 23.701 μs (0 allocations)  # apple

Edit: Note also that you can wrap S in Symmetric to get some routines specialised for such matrices. However, only one of them matters, and it isn’t always faster:

julia> issymmetric(S)
true

julia> @btime mul!($buff2,($buff3 .= $S .* mul!($buff, Symmetric($S), ($buff3 .= $d .* $S))),$X);
  min 99.125 μs, mean 246.848 μs (0 allocations)  # openblas
  min 72.834 μs, mean 92.357 μs (0 allocations)  # apple

And for completeness, zero-alloc tturbo version from below:

julia> out=similar(X); temp1=similar(S); temp2=similar(S);

julia> @btime full_comput!($out, $temp1, $temp2, $S, $d, $X);  # from @romainvieme below
  min 25.584 μs, mean 28.031 μs (0 allocations)
2 Likes

Another path is to try @tturbo on the complete loop. You may be able to improve on that by adjusting the order in which the indexes are run (although I think the macro already does that). In any case, at least here the single allocation observed is of course the out vector and then you could do this non-allocating easily. The performance is not as good as that of the other alternatives though.

julia> using LoopVectorization

julia> function naive_loop(S::Array{T,2},d::Vector{T},X::Array{T,2}) where T 
           out = similar(X) 
           @tturbo for i in 1:size(S,1) 
               for l in 1:size(X,2)
                  r = zero(T) 
                  for j in 1:size(S,1) 
                      for k in 1:size(S,1) 
                         r += S[i,j]*S[i,k]*S[j,k]*d[k]*X[j,l] 
                      end 
                  end 
                  out[i,l] += r
               end 
           end 
           out 
       end
naive_loop (generic function with 1 method)

julia> @btime naive_loop($S,$d,$X);
  243.603 μs (1 allocation: 7.94 KiB)


ps: Array{T,2} can be written as Matrix{T} (and usually it is better to use AbstractVector and AbstractMatrix such that the function accepts views, etc.

1 Like

Hello, on my machine using @tturbo and buffers slightly outspeeds MKL

using LoopVectorization, LinearAlgebra, Random, BenchmarkTools, MKL


function hadamard!(out, S::Array{T, 2}, d::Array{T, 1}) where T
	@tturbo for i=1:size(S, 1)
		for j=1:size(S, 2)
			out[i, j]=d[i]*S[i, j]
		end
	end
end
function hadamard!(out, U::Array{T, 2}, V::Array{T,2}) where T
	@tturbo for i=1:size(U, 1)
		for j=1:size(U,2)
			out[i, j]=U[i, j]*V[i, j]
		end
	end
end

#temp1 and temp2 are similar to S
function full_comput!(out, temp1,temp2,  S, d, X)
	hadamard!(temp1, S, d)
	mul!(temp2, S, temp1)
	hadamard!(temp1, S, temp2)
	mul!(out, temp1, X)
end


Random.seed!(1); 
n = 100; 
p = 10; 

S = randn(n,n); 
S = S'*S; 
d = randn(n); 
X = randn(n,p); 

out=similar(X); temp1=similar(S); temp2=similar(S);

and with julia -t 16

julia> @btime full_comput!($out, $temp1, $temp2, $S, $d, $X);
  32.738 μs (0 allocations: 0 bytes)

julia> @btime ($S .* ($S * ($d .* $S))) * $X;
  36.222 μs (7 allocations: 242.45 KiB)
1 Like

Maybe LoopVectorization.jl can reorder things, but it is normally faster to iterate down columns, not along rows.

1 Like

Thanks for the rhelp!

Indeed, since I have no understanding of the workings of @tullio, I was hoping for some magic that did not materialize.

I tried this and with results close to ($S .* ($S * ($d .* $S))) * $X but then I figured I might as well go with that.

What I find interesting that even fmaking S symmetric first does not improve performance here.

SS = Symmetric(S); 
julia> @btime mul!($buff2,($buff3 .= $S .* mul!($buff, $SS, ($buff3 .= $d .* $S))),$X);
  77.914 μs (0 allocations: 0 bytes)

zero-alloc seems to be the way to go, thanks for the comments!

You’re right ! It’s even faster if I switch the two for loops.

julia> @btime full_comput!($out, $temp1, $temp2, $S, $d, $X);
  29.375 μs (0 allocations: 0 bytes)

Thanks!

It may be… but I think @turbo reorders the loops, so that is probably noise. It won’t harm, though.

That looks pretty good too! I’m using a function along the lines of hadamrd! in my optimization so I might try this approach.

I guess I have to thoroughly benchmark all approaches and check how the suggestions integrate into the whole optimization routine.

Thanks for the help all!

Yes, it should.

I am planning on implementing the O(N^4) → O(N^3) optimization in the rewrite because code like this keeps coming up.
Fwiw, there’s plenty of literature on this, e.g. [2010.03074] On Simplifying Dependent Polyhedral Reductions

1 Like