Another path is to try @tturbo
on the complete loop. You may be able to improve on that by adjusting the order in which the indexes are run (although I think the macro already does that). In any case, at least here the single allocation observed is of course the out
vector and then you could do this non-allocating easily. The performance is not as good as that of the other alternatives though.
julia> using LoopVectorization
julia> function naive_loop(S::Array{T,2},d::Vector{T},X::Array{T,2}) where T
out = similar(X)
@tturbo for i in 1:size(S,1)
for l in 1:size(X,2)
r = zero(T)
for j in 1:size(S,1)
for k in 1:size(S,1)
r += S[i,j]*S[i,k]*S[j,k]*d[k]*X[j,l]
end
end
out[i,l] += r
end
end
out
end
naive_loop (generic function with 1 method)
julia> @btime naive_loop($S,$d,$X);
243.603 μs (1 allocation: 7.94 KiB)
ps: Array{T,2}
can be written as Matrix{T}
(and usually it is better to use AbstractVector
and AbstractMatrix
such that the function accepts views, etc.