# Optimizing dynamical system in a network

Hi, I am trying to integrate a simple dynamical system on a network, which is represented by an adjacency matrix. Note that the network in this case is dense and hence I am using the full matrix for it.

Initially I wrote the following code for the integration step of the system

``````@fastmath function sigmoid(x::Float64)
return 1. / (1. + exp(-x))
end

function update_step!(xt::AbstractVector, w::AbstractMatrix, dt::Float64, auxinput::AbstractVector)
mul!(auxinput, w, xt) #auxinput = W*xt
@inbounds @. xt += dt * (-xt + sigmoid(auxinput))
end
``````

Then, I have to take measurements of the dynamics. In my case, this is the average of the variances of the trajectories. So I need to calculate the temporal variance for each individual and then average over. This is what I am using for it

``````function simulate_network(N::Int64, thermal_steps::Int64, simulation_steps::Int64, x::AbstractVector,  w::AbstractMatrix, dt::Float64, auxinput::AbstractVector, auxmean::AbstractVector, auxvar::AbstractVector)

for t=1:thermal_steps
update_step!(x, w, dt, auxinput)
end

for t=1:simulation_steps
update_step!(x, w, dt, auxinput)

@inbounds @simd for i=1:N
auxmean[i] += x[i]
auxvar[i]  += x[i]^2
end
end

return mean( auxvar ./ simulation_steps .- (auxmean ./ simulation_steps) .^2 )
end
``````

Notice that I take the definition of intermediate auxiliary variables, which are defined in the global scope and then passed to these functions.

When I `@benchmark` this function without thermalization, for a network size 300x300 and 1000 iterations I get an average time of 8.5ms in my computer.
I thought, however, that the code above could be improved by computing the `auxmean` and `auxvar` in the same loop used to update the system, so writing

``````function update_step_noinput2!(N::Int64, xt::AbstractVector, W::AbstractMatrix, dt::Float64,
auxinput::AbstractVector, meanx::AbstractVector, meanx2::AbstractVector)

mul!(auxinput, W, xt) #auxinput = W*xt
@inbounds @simd for i=1:N
xt[i] += dt * (-xt[i] + sigmoid(auxinput[i]))
meanx[i] += xt[i]
meanx2[i] += xt[i]^2
end
end

function simulate_network2(N::Int64, thermal_steps::Int64, simulation_steps::Int64, x::AbstractVector,
w::AbstractMatrix, dt::Float64, auxinput::AbstractVector, auxmean::AbstractVector, auxvar::AbstractVector)

for t=1:thermal_steps
update_step!(x, w, dt, auxinput)
end

for t=1:simulation_steps
update_step_noinput2!(N, x, w, dt, auxinput, auxmean, auxvar)
end

return mean( auxvar ./ simulation_steps .- (auxmean ./ simulation_steps) .^2 )
end
``````

However, the second code is slightly slower (average time 9ms). In fact, I thought that it would pay off to introduce the `mul!` function also inside the loop, so avoid doing `N` iterations twice, but then the code becomes extremely slow, with a huge memory allocationâ€¦

``````function update_step_noinput3!(N::Int64, xt::AbstractVector, W::AbstractMatrix, dt::Float64,
auxinput::AbstractVector, meanx::AbstractVector, meanx2::AbstractVector)

@inbounds @simd for i=1:N
inpt = 0.0
@simd for j=1:N
inpt += w[j,i]*xt[j]
end
xt[i] += dt * (-xt[i] + sigmoid(inpt))
meanx[i] += xt[i]
meanx2[i] += xt[i]^2
end
end
``````

The biggest problem here seems to be that I am missing something important about how Julia handles things But after many tests I am not able grasp what the problem is. Any help? Why is the second version slower than the first one? And where are the memory allocations coming from in the third example?

It would help if you provided exact scripts that we can copy and paste to run and see what youâ€™re seeing.

Sure, here you have the functions above in a handy Julia script

network_code.jl (2.6 KB)

(EDIT: I had to reupload because I realised I copied slightly old code; now matches that of the question).

I executed them with this code

``````N = 300
w = randn(N,N)
w2 = deepcopy(w')
x = rand(N)
auxinput = zeros(N)
meanx = zeros(N)
meanx2 = zeros(N)

#Execute these one by one to see the result in REPL
@benchmark simulate_network(N, 0, 1000, x, w, 0.2, auxinput, meanx, meanx2)
@benchmark simulate_network2(N, 0, 1000, x, w, 0.2, auxinput, meanx, meanx2)
@benchmark simulate_network3(N, 0, 1000, x, w2, 0.2, auxinput, meanx, meanx2)
``````

Unless your matrix is triangular, this is not equivalent to your original code: You update `xt[1]` in `i=1`, and then incorporate this for `i=2, j=1` when computing `inpt += w[1,2]*xt[1]`.

The problem with the third option is that you have a typo here:

``````function update_step_noinput3!(N::Int64, xt, W::AbstractMatrix, dt::Float64,
``````

where `W` is upper case, thus the lower-case `w` used inside the function is a global variable.

ps: I found that because with some experience one can see that the function should be non-allocating, thus something like a typo should be there.

But this is the kind of thing that the VScode linter warns you:

(W and auxinput are not used within the function)

Both of you are totally correct. First of all I should swap the old variables with the new ones, or the results will be incorrect. But this will make the code slightly slower even, so still does not explain why the second option is slower even when I do N iterations less.

The extremely slow time of the third option is from the typo. However, once this is fixed it is still 17ms in my machine, which is sensibly slower than the other options.

Such sloppy errorsâ€¦Also I had to edit a lot my post because I did also a terrible job at re-organizing my code for the question. Apologies, I have been maybe too much time looking at this and I guess Iâ€™m very tired.

If you let me some time more I will update the codes with correct code. But in any case, quite sure that I am missing something fundamental.

Edit:
New version of the code with the swapping added. Timing did not change much.
network_code.jl (2.8 KB)

``````N = 300;
w = randn(N,N);
w2 = deepcopy(w');
x = zeros(N);
oldx = rand(N);
auxinput = zeros(N);
meanx = zeros(N);
meanx2 = zeros(N);

@benchmark simulate_network(N, 0, 1000, x, w, 0.2, auxinput, meanx,meanx2)
@benchmark simulate_network2(N, 0, 1000, oldx, x, w, 0.2, auxinput, meanx, meanx2)
@benchmark simulate_network3(N, 0, 1000, oldx, x, w2, 0.2, auxinput, meanx, meanx2)

``````

ps: you donÂ´t need to keep passing `N` around. You can use `length(x)` to get the dimension of the array, or even better use `eachindex(x)`, `axes`, which will guarantee that you are iterating over the correct indexes. For instance, this would be more natural in Julia for your third functions:

``````function update_step_noinput3!(xt, w, dt, auxinput, meanx, meanx2)
for i in axes(w, 2)
inpt = zero(eltype(w))
for j in axes(w, 1)
inpt += w[j,i]*xt[j]
end
xt[i] += dt * (-xt[i] + sigmoid(inpt))
meanx[i] += xt[i]
meanx2[i] += xt[i]^2
end
end
``````

Thanks for the info, Iâ€™m aware of it but Iâ€™m not quite used to the syntax yet (coming from C++) and I strongly prefer to have my indices explicitly. Guess I might just need more time in the language to get comfortable with them.

Still the main question, which is why options 2 and 3 are slower (especially 3) despite of having way less iterations, itâ€™s still open.

For that I suggest again providing an exactly reproducible code of what youâ€™re seeing, with the updated codes.

edit: I see you did update the codes.

So my guess: The implementation of the matrix-vector product in `mul!` is really fast. And when you put it inside the loop that also computes the means, you do not allow proper vectorization of the operations.

For instance, this recovers the performance:

``````using LoopVectorization
function update_step3!(N::Int64, old_xt::AbstractVector, xt::AbstractVector, W::AbstractMatrix, dt::Float64,
meanx::AbstractVector, meanx2::AbstractVector, auxinput)
auxinput .= 0.0
@turbo for i in 1:N
for j in 1:N
auxinput[i] += W[j,i]*old_xt[j]
end
end
for i=1:N
xt[i] += dt * (-xt[i] + sigmoid(auxinput[i]))
meanx[i] += xt[i]
meanx2[i] += xt[i]^2
end
old_xt, xt = xt, old_xt
end
``````

The script with the code I am using right now, corrections to your previous comments included, should be here. Let me know if thereâ€™s any problem with it.

1 Like

See my comment above.

I am unfamiliar with `LoopVectorization.jl` but I guess itâ€™s impossible to use the `@turbo` decorator in front of the whole loop, isnâ€™t it? I donâ€™t understand why the means break the vectorization of operations.

Also, even assuming `mul!` has some kind of nice magic inside of it, what happens between versions 1 and 2? In the first one, I have the `@.` operator and then compute the means separately. Thus I should be doing a loop with N elements twice. The first time could get a speedup due to vectorization, but I find hard to believe that this is good enough to beat a single loop with N iterations where everything is done at once.

Then thereâ€™s no way to get better than version 1?

Thank you so much!

â€śVectorizationâ€ť in this context means using the ability of the processors to perform multiple similar operations simultaneously (see Advanced Vector Extensions - Wikipedia). This is what you are asking for when you put the `@simd` flag in front of the loop.

All relevant linear algebra routines use that extensively, and thus `mul!` is very fast. LoopVectorization does the same, and makes the loop faster as well. But those require some regularity in the set of instructions performed, thus if the computations get complicated you may not be able to use that processor ability anymore.

Then, it is sometimes better to split the loop in vectorizable parts, as is the case here.

For `N = 300` I donâ€™t know of anything else you can do. Note that the call to `mul!` is multi-threaded, and you could emulate that with `@tturbo` (with two `t`s), but for that size it does not appear to make a difference.

On the other side, if you have larger `N`, for instance `N = 3000`, parallelization or using GPUs may help. Your version 1 is simple enough that using a GPU involves basically wrapping the arrays:

Running this:

``````sigmoid(x) = inv(one(x) + exp(-x))
function update_step!(x, w, dt, auxinput)
mul!(auxinput, w, x) #auxinput = W*xt
@. x += dt * (-x + sigmoid(auxinput))
end
function simulate_network(simulation_steps, x, w, dt, auxinput, auxmean, auxvar)
for t=1:simulation_steps
#update_step_noinput2!(x, w, dt, auxinput)
update_step!(x, w, dt, auxinput)
auxmean .+= x
auxvar .+= x .^2
end
return mean( auxvar ./ simulation_steps .- (auxmean ./ simulation_steps) .^2 )
end
``````

I get, for N = 3000:

``````julia> using CUDA

julia> @btime simulate_network(1000, \$x, \$w, 0.2, \$auxinput, \$meanx, \$meanx2)
4.302 s (4 allocations: 23.50 KiB)
-23.77517260930618

# now using the GPU: use CuArrays
julia> @btime simulate_network(1000, \$(CuArray(x)), \$(CuArray(w)), 0.2, \$(CuArray(auxinput)), \$(CuArray(meanx)), \$(CuArray(meanx2)))
1.666 s (96133 allocations: 6.93 MiB)
-131.2218251491212
``````

(be careful with the initialization of the arrays, because even auxiliary vectors are being mutated and affecting the results)

Makes sense. Thank you very much for your patience.

It seems I was underestimating the amount of speedups that one can get from vectorization. Then writing separated, vectorizable loops is a way better option than putting everything into a single, non-vectorizable one.

If you have any recommended material on how to write vectorizable code in Julia, please share. Thanks!!

I donÂ´t. Mostly I donÂ´t think that is special to Julia. The same patterns of code will in general allow the compiler to vectorize loops in different languages. The `LoopVectorization` package helps to get very efficient vectorizations when it can convert loop to a proper syntax, but many times just keeping the loop â€śsimpleâ€ť (whatever that means) and adding `@inbounds` and `@simd` is enough.