Hi, I am trying to integrate a simple dynamical system on a network, which is represented by an adjacency matrix. Note that the network in this case is dense and hence I am using the full matrix for it.

Initially I wrote the following code for the integration step of the system

```
@fastmath function sigmoid(x::Float64)
return 1. / (1. + exp(-x))
end
function update_step!(xt::AbstractVector, w::AbstractMatrix, dt::Float64, auxinput::AbstractVector)
mul!(auxinput, w, xt) #auxinput = W*xt
@inbounds @. xt += dt * (-xt + sigmoid(auxinput))
end
```

Then, I have to take measurements of the dynamics. In my case, this is the average of the variances of the trajectories. So I need to calculate the temporal variance for each individual and then average over. This is what I am using for it

```
function simulate_network(N::Int64, thermal_steps::Int64, simulation_steps::Int64, x::AbstractVector, w::AbstractMatrix, dt::Float64, auxinput::AbstractVector, auxmean::AbstractVector, auxvar::AbstractVector)
for t=1:thermal_steps
update_step!(x, w, dt, auxinput)
end
for t=1:simulation_steps
update_step!(x, w, dt, auxinput)
@inbounds @simd for i=1:N
auxmean[i] += x[i]
auxvar[i] += x[i]^2
end
end
return mean( auxvar ./ simulation_steps .- (auxmean ./ simulation_steps) .^2 )
end
```

Notice that I take the definition of intermediate auxiliary variables, which are defined in the global scope and then passed to these functions.

When I `@benchmark`

this function without thermalization, for a network size 300x300 and 1000 iterations I get an average time of 8.5ms in my computer.

I thought, however, that the code above could be improved by computing the `auxmean`

and `auxvar`

in the same loop used to update the system, so writing

```
function update_step_noinput2!(N::Int64, xt::AbstractVector, W::AbstractMatrix, dt::Float64,
auxinput::AbstractVector, meanx::AbstractVector, meanx2::AbstractVector)
mul!(auxinput, W, xt) #auxinput = W*xt
@inbounds @simd for i=1:N
xt[i] += dt * (-xt[i] + sigmoid(auxinput[i]))
meanx[i] += xt[i]
meanx2[i] += xt[i]^2
end
end
function simulate_network2(N::Int64, thermal_steps::Int64, simulation_steps::Int64, x::AbstractVector,
w::AbstractMatrix, dt::Float64, auxinput::AbstractVector, auxmean::AbstractVector, auxvar::AbstractVector)
for t=1:thermal_steps
update_step!(x, w, dt, auxinput)
end
for t=1:simulation_steps
update_step_noinput2!(N, x, w, dt, auxinput, auxmean, auxvar)
end
return mean( auxvar ./ simulation_steps .- (auxmean ./ simulation_steps) .^2 )
end
```

However, the second code is slightly slower (average time 9ms). In fact, I thought that it would pay off to introduce the `mul!`

function also inside the loop, so avoid doing `N`

iterations twice, but then the code becomes extremely slow, with a huge memory allocationâ€¦

```
function update_step_noinput3!(N::Int64, xt::AbstractVector, W::AbstractMatrix, dt::Float64,
auxinput::AbstractVector, meanx::AbstractVector, meanx2::AbstractVector)
@inbounds @simd for i=1:N
inpt = 0.0
@simd for j=1:N
inpt += w[j,i]*xt[j]
end
xt[i] += dt * (-xt[i] + sigmoid(inpt))
meanx[i] += xt[i]
meanx2[i] += xt[i]^2
end
end
```

The biggest problem here seems to be that I am missing something important about how Julia handles things But after many tests I am not able grasp what the problem is. Any help? Why is the second version slower than the first one? And where are the memory allocations coming from in the third example?