Optimizing dynamical system in a network

Hi, I am trying to integrate a simple dynamical system on a network, which is represented by an adjacency matrix. Note that the network in this case is dense and hence I am using the full matrix for it.

Initially I wrote the following code for the integration step of the system

@fastmath function sigmoid(x::Float64)
    return 1. / (1. + exp(-x)) 
end

function update_step!(xt::AbstractVector, w::AbstractMatrix, dt::Float64, auxinput::AbstractVector)
    mul!(auxinput, w, xt) #auxinput = W*xt 
    @inbounds @. xt += dt * (-xt + sigmoid(auxinput))
end

Then, I have to take measurements of the dynamics. In my case, this is the average of the variances of the trajectories. So I need to calculate the temporal variance for each individual and then average over. This is what I am using for it

function simulate_network(N::Int64, thermal_steps::Int64, simulation_steps::Int64, x::AbstractVector,  w::AbstractMatrix, dt::Float64, auxinput::AbstractVector, auxmean::AbstractVector, auxvar::AbstractVector)

    for t=1:thermal_steps
        update_step!(x, w, dt, auxinput)
    end

    for t=1:simulation_steps
        update_step!(x, w, dt, auxinput)

        @inbounds @simd for i=1:N
            auxmean[i] += x[i]
            auxvar[i]  += x[i]^2
        end
    end

    return mean( auxvar ./ simulation_steps .- (auxmean ./ simulation_steps) .^2 )  
end

Notice that I take the definition of intermediate auxiliary variables, which are defined in the global scope and then passed to these functions.

When I @benchmark this function without thermalization, for a network size 300x300 and 1000 iterations I get an average time of 8.5ms in my computer.
I thought, however, that the code above could be improved by computing the auxmean and auxvar in the same loop used to update the system, so writing

function update_step_noinput2!(N::Int64, xt::AbstractVector, W::AbstractMatrix, dt::Float64,
    auxinput::AbstractVector, meanx::AbstractVector, meanx2::AbstractVector)

    mul!(auxinput, W, xt) #auxinput = W*xt 
    @inbounds @simd for i=1:N
        xt[i] += dt * (-xt[i] + sigmoid(auxinput[i]))
        meanx[i] += xt[i]
        meanx2[i] += xt[i]^2
    end
end

function simulate_network2(N::Int64, thermal_steps::Int64, simulation_steps::Int64, x::AbstractVector, 
    w::AbstractMatrix, dt::Float64, auxinput::AbstractVector, auxmean::AbstractVector, auxvar::AbstractVector)

    for t=1:thermal_steps
        update_step!(x, w, dt, auxinput)
    end

    for t=1:simulation_steps
        update_step_noinput2!(N, x, w, dt, auxinput, auxmean, auxvar)
    end

    return mean( auxvar ./ simulation_steps .- (auxmean ./ simulation_steps) .^2 )  
end

However, the second code is slightly slower (average time 9ms). In fact, I thought that it would pay off to introduce the mul! function also inside the loop, so avoid doing N iterations twice, but then the code becomes extremely slow, with a huge memory allocation…

function update_step_noinput3!(N::Int64, xt::AbstractVector, W::AbstractMatrix, dt::Float64,
    auxinput::AbstractVector, meanx::AbstractVector, meanx2::AbstractVector)

    @inbounds @simd for i=1:N
        inpt = 0.0 
        @simd for j=1:N
            inpt += w[j,i]*xt[j]
        end
        xt[i] += dt * (-xt[i] + sigmoid(inpt))
        meanx[i] += xt[i]
        meanx2[i] += xt[i]^2
    end
end

The biggest problem here seems to be that I am missing something important about how Julia handles things :slight_smile: But after many tests I am not able grasp what the problem is. Any help? Why is the second version slower than the first one? And where are the memory allocations coming from in the third example?

It would help if you provided exact scripts that we can copy and paste to run and see what you’re seeing.

Sure, here you have the functions above in a handy Julia script

network_code.jl (2.6 KB)

(EDIT: I had to reupload because I realised I copied slightly old code; now matches that of the question).

I executed them with this code

N = 300
w = randn(N,N)
w2 = deepcopy(w')
x = rand(N) 
auxinput = zeros(N) 
meanx = zeros(N) 
meanx2 = zeros(N) 

#Execute these one by one to see the result in REPL
@benchmark simulate_network(N, 0, 1000, x, w, 0.2, auxinput, meanx, meanx2)
@benchmark simulate_network2(N, 0, 1000, x, w, 0.2, auxinput, meanx, meanx2)
@benchmark simulate_network3(N, 0, 1000, x, w2, 0.2, auxinput, meanx, meanx2)

Hope this can help you to make your own tests if needed. Thank you so much in advance!

Unless your matrix is triangular, this is not equivalent to your original code: You update xt[1] in i=1, and then incorporate this for i=2, j=1 when computing inpt += w[1,2]*xt[1].

The problem with the third option is that you have a typo here:

function update_step_noinput3!(N::Int64, xt, W::AbstractMatrix, dt::Float64,

where W is upper case, thus the lower-case w used inside the function is a global variable.

ps: I found that because with some experience one can see that the function should be non-allocating, thus something like a typo should be there.

But this is the kind of thing that the VScode linter warns you:

(W and auxinput are not used within the function)

Both of you are totally correct. First of all I should swap the old variables with the new ones, or the results will be incorrect. But this will make the code slightly slower even, so still does not explain why the second option is slower even when I do N iterations less.

The extremely slow time of the third option is from the typo. However, once this is fixed it is still 17ms in my machine, which is sensibly slower than the other options.

Such sloppy errors…Also I had to edit a lot my post because I did also a terrible job at re-organizing my code for the question. Apologies, I have been maybe too much time looking at this and I guess I’m very tired.

If you let me some time more I will update the codes with correct code. But in any case, quite sure that I am missing something fundamental.

Edit:
New version of the code with the swapping added. Timing did not change much.
network_code.jl (2.8 KB)

N = 300;
w = randn(N,N);
w2 = deepcopy(w');
x = zeros(N);
oldx = rand(N);
auxinput = zeros(N);
meanx = zeros(N);
meanx2 = zeros(N); 

@benchmark simulate_network(N, 0, 1000, x, w, 0.2, auxinput, meanx,meanx2)
@benchmark simulate_network2(N, 0, 1000, oldx, x, w, 0.2, auxinput, meanx, meanx2)
@benchmark simulate_network3(N, 0, 1000, oldx, x, w2, 0.2, auxinput, meanx, meanx2)

ps: you don´t need to keep passing N around. You can use length(x) to get the dimension of the array, or even better use eachindex(x), axes, which will guarantee that you are iterating over the correct indexes. For instance, this would be more natural in Julia for your third functions:

function update_step_noinput3!(xt, w, dt, auxinput, meanx, meanx2)
    for i in axes(w, 2)
        inpt = zero(eltype(w))
        for j in axes(w, 1)
            inpt += w[j,i]*xt[j]
        end
        xt[i] += dt * (-xt[i] + sigmoid(inpt))
        meanx[i] += xt[i]
        meanx2[i] += xt[i]^2
    end
end

Thanks for the info, I’m aware of it but I’m not quite used to the syntax yet (coming from C++) and I strongly prefer to have my indices explicitly. Guess I might just need more time in the language to get comfortable with them.

Still the main question, which is why options 2 and 3 are slower (especially 3) despite of having way less iterations, it’s still open.

For that I suggest again providing an exactly reproducible code of what you’re seeing, with the updated codes.

edit: I see you did update the codes.

So my guess: The implementation of the matrix-vector product in mul! is really fast. And when you put it inside the loop that also computes the means, you do not allow proper vectorization of the operations.

For instance, this recovers the performance:

using LoopVectorization
function update_step3!(N::Int64, old_xt::AbstractVector, xt::AbstractVector, W::AbstractMatrix, dt::Float64,
    meanx::AbstractVector, meanx2::AbstractVector, auxinput)
    auxinput .= 0.0
    @turbo for i in 1:N
        for j in 1:N
            auxinput[i] += W[j,i]*old_xt[j] 
        end
    end
    for i=1:N
        xt[i] += dt * (-xt[i] + sigmoid(auxinput[i]))
        meanx[i] += xt[i]
        meanx2[i] += xt[i]^2
    end
    old_xt, xt = xt, old_xt
end

The script with the code I am using right now, corrections to your previous comments included, should be here. Let me know if there’s any problem with it.

1 Like

See my comment above.

I am unfamiliar with LoopVectorization.jl but I guess it’s impossible to use the @turbo decorator in front of the whole loop, isn’t it? I don’t understand why the means break the vectorization of operations.

Also, even assuming mul! has some kind of nice magic inside of it, what happens between versions 1 and 2? In the first one, I have the @. operator and then compute the means separately. Thus I should be doing a loop with N elements twice. The first time could get a speedup due to vectorization, but I find hard to believe that this is good enough to beat a single loop with N iterations where everything is done at once.

Then there’s no way to get better than version 1?

Thank you so much!

“Vectorization” in this context means using the ability of the processors to perform multiple similar operations simultaneously (see Advanced Vector Extensions - Wikipedia). This is what you are asking for when you put the @simd flag in front of the loop.

All relevant linear algebra routines use that extensively, and thus mul! is very fast. LoopVectorization does the same, and makes the loop faster as well. But those require some regularity in the set of instructions performed, thus if the computations get complicated you may not be able to use that processor ability anymore.

Then, it is sometimes better to split the loop in vectorizable parts, as is the case here.

For N = 300 I don’t know of anything else you can do. Note that the call to mul! is multi-threaded, and you could emulate that with @tturbo (with two ts), but for that size it does not appear to make a difference.

On the other side, if you have larger N, for instance N = 3000, parallelization or using GPUs may help. Your version 1 is simple enough that using a GPU involves basically wrapping the arrays:

Running this:

sigmoid(x) = inv(one(x) + exp(-x))
function update_step!(x, w, dt, auxinput)
    mul!(auxinput, w, x) #auxinput = W*xt 
    @. x += dt * (-x + sigmoid(auxinput))
end
function simulate_network(simulation_steps, x, w, dt, auxinput, auxmean, auxvar)
    for t=1:simulation_steps
        #update_step_noinput2!(x, w, dt, auxinput)
        update_step!(x, w, dt, auxinput)
        auxmean .+= x
        auxvar .+= x .^2
    end
    return mean( auxvar ./ simulation_steps .- (auxmean ./ simulation_steps) .^2 )
end

I get, for N = 3000:

julia> using CUDA

julia> @btime simulate_network(1000, $x, $w, 0.2, $auxinput, $meanx, $meanx2)
  4.302 s (4 allocations: 23.50 KiB)
-23.77517260930618

# now using the GPU: use CuArrays
julia> @btime simulate_network(1000, $(CuArray(x)), $(CuArray(w)), 0.2, $(CuArray(auxinput)), $(CuArray(meanx)), $(CuArray(meanx2)))
  1.666 s (96133 allocations: 6.93 MiB)
-131.2218251491212

(be careful with the initialization of the arrays, because even auxiliary vectors are being mutated and affecting the results)

Makes sense. Thank you very much for your patience.

It seems I was underestimating the amount of speedups that one can get from vectorization. Then writing separated, vectorizable loops is a way better option than putting everything into a single, non-vectorizable one.

If you have any recommended material on how to write vectorizable code in Julia, please share. Thanks!!

I don´t. Mostly I don´t think that is special to Julia. The same patterns of code will in general allow the compiler to vectorize loops in different languages. The LoopVectorization package helps to get very efficient vectorizations when it can convert loop to a proper syntax, but many times just keeping the loop “simple” (whatever that means) and adding @inbounds and @simd is enough.