Speeding up Matrix multiplication involving dot and hadamard product

Another path is to try @tturbo on the complete loop. You may be able to improve on that by adjusting the order in which the indexes are run (although I think the macro already does that). In any case, at least here the single allocation observed is of course the out vector and then you could do this non-allocating easily. The performance is not as good as that of the other alternatives though.

julia> using LoopVectorization

julia> function naive_loop(S::Array{T,2},d::Vector{T},X::Array{T,2}) where T 
           out = similar(X) 
           @tturbo for i in 1:size(S,1) 
               for l in 1:size(X,2)
                  r = zero(T) 
                  for j in 1:size(S,1) 
                      for k in 1:size(S,1) 
                         r += S[i,j]*S[i,k]*S[j,k]*d[k]*X[j,l] 
                      end 
                  end 
                  out[i,l] += r
               end 
           end 
           out 
       end
naive_loop (generic function with 1 method)

julia> @btime naive_loop($S,$d,$X);
  243.603 μs (1 allocation: 7.94 KiB)


ps: Array{T,2} can be written as Matrix{T} (and usually it is better to use AbstractVector and AbstractMatrix such that the function accepts views, etc.

1 Like