How can one improve performance of the following MWE (using 1.0.0)?
function bg_sim_mwe(nB=100,nSamples=100)
e = 0.9
p = 0.8
nt = 100
t = range(0,stop=5.0,length=nt)
V(ph) = 1.0  p + p*(cos(ph) + 1im*e*sin(ph))
phi = zeros(nt)
Vtotal = zeros(Complex,nt)
for a = 1:nSamples
om = randn(nB) # this replaces an actual calculation
for it = 1:nt
phi .= om.*t[it]
Vtotal[it] += prod(V,phi)
end
end
end
This runs as follows:
julia> @btime bg_sim_mwe()
18.681 ms (30301 allocations: 1.09 MiB)
In my actual code, nB=10_000
and nSamples=1_000_000
.
Here is my current state of insight:

@inbounds
in front of thefor
statement make no difference. 
@fastmath
in front of thefor
statement saves 1/3 of the allocations, but does not affect performance beyond a few percent. 
julia trackallocation=user
indicates that 960k of allocations happen in the line with the+=
. I assume these must occur inside ofprod
.Since the outer and the inner loop each run 100 times, this means that this line appears to requires 96k of allocations. That seems a lot for a 100element complex vector that’s created on the fly and reduced viaprod
. Maybe I don’t understand what is going on here. 
@code_warntype bg_sim_mwe(100,100)
shows some red for the line with the+=
:::Complex
and::Complex{_1} where _1
. I don’t understand what to do about this.Vtotal
is preallocated as a complexvalued array.
Any help is appreciated!
(Edit: Did wrong calculation of allocations.)