Loop fusion will actually be slower here. This is clearer if we simplify the code a bit. Suppose we are doing X .= colvec .* cis.(rowvec)
(i.e. combining a column vector and a row vector to make a matrix). This is lowered to broadcast!((x,y) -> x * cis(y), X, colvec, rowvec)
, which is essentially equivalent to:
for j=1:length(rowvec), i=1:length(colvec)
X[i,j] = colvec[i] * cis(rowvec[j])
end
The problem here is that if X
is m×n, then we end up calling the cis
function mn times.
If, instead, we use
tmp = cis.(rowvec)
X .= colvec .* tmp
it only calls the cis
function n times, at the expense of requiring more storage.
This is the sort of space/time tradeoff that it is hard to automate. As a rule of thumb, however, if you are doing a broadcast operation combining a row and a column vector, it will be faster if you do any expensive operations on the row and column vector before doing the broadcast combination.
In this particular case, I would suggest doing something like:
function test_perf5()
rangeᵀ = (1:2000000)'
rowtemp = similar(Complex128, rangeᵀ)
steering_vectors = complex.(ones(4,11), ones(4,11)) # placeholder for actual vectors?
sum_signal = zeros(Complex{Float64}, 4, length(rangeᵀ))
for i = 1:11
rowtemp .= cis.(1.6069246423111792 .* rangeᵀ .+ 0.6981317007977318)
sum_signal .+= steering_vectors[:,i] .* rowtemp
end
return sum_signal
end
You could also maybe try @views
, or simply allocate steering_vectors
as an array of vectors, to avoid allocating a copy in steering_vectors[:,i]
. You can also use @.
before the for
and then omit all of the dots.