How to make this piece of code faster? optimising it or using multiple threads?

I have a piece of code involving a large matrix. I have been trying to optimise it. I have exhausted all my options, I would like to know of suggestions that experts on this forum might have.

using Distributions
using StatsBase
using LightGraphs
using LinearAlgebra

mutable struct Par
    g::SimpleDiGraph{Int64} ;
    μ::Int64 ; 
    n::Int64 ; 
    P::Int64 ; 
    ctime::Int64 ; 
    cstub::Float64 ; 
    crate::Int64 ; 
    cvalue::Float64 ; 
    expcval::Float64 ; 
    temp::Bool ;
end

mutable struct Arr
    θ::Array{Float64,1} ;
    C::Array{Float64,1} ; 
    wts::Array{Float64,1} ; 
    upagents::Array{Int64,1} ; 
    edg::Array{Int64,1} ; 
    cedg::Array{Int64,1} ; 
    temp::Array{Float64,1} ;
end

mutable struct Mat
    θₜ::Array{Float64,2} ; 
end

function main(Cmin::Float64,Cmax::Float64,N::Int64,outdeg::Int64,Pᵣ::Float64,σ::Int64,Pᵢ::Float64)
    p = Par(SimpleDiGraph(Int64),90,5000,Int64(N/2),0,0.0,0,0.0,0.0,false) ;
    a = Arr(Float64[],Float64[],Float64[],Int64[],Int64[],Int64[],Float64[]) ;
    m = Mat(Matrix{Float64}(undef,N,p.n)) ;
    p.g = watts_strogatz(N,outdeg,Páµ£,is_directed=true) ;
    a.θ = rand(Normal(p.μ,σ),N)*pi/180 ;
    a.C = rand(Uniform(Cmin,Cmax),N) ;
    a.wts = 1.0 .- a.C ;
    a.wts = a.wts/sum(a.wts) ;
    for t = 1:1:p.n
        m.θₜ[:,t] = a.θ ;
        a.upagents = sample([1:1:N;],Weights(a.wts),p.P,replace=false) ;
        for i ∈ a.upagents
            a.edg = inneighbors(p.g,i) ;
            a.cedg = setdiff([1:1:N;],a.edg) ;
            a.cedg = setdiff(vcat(a.edg,a.cedg[rand(length(a.cedg)).<Páµ¢]),i) ;
            a.temp = exp.(-abs.(m.θₜ[i,t] .-m.θₜ[a.cedg,t])/(1.0-a.C[i])) ;
            a.temp = vcat(a.C[i],(a.temp*(1-a.C[i]))/sum(a.temp)) ;
            a.cedg = vcat(i,a.cedg) ;
            a.θ[i] = atan(sum(a.temp.*sin.(m.θₜ[a.cedg,t])),sum(a.temp.*cos.(m.θₜ[a.cedg,t]))) ;
        end
        a.temp = round.(a.θ,digits=2) ;
        p.temp = all(y->y.==a.temp[1],a.temp) ;
        if(p.temp==true)
            p.ctime = t ;
            p.crate = 1 ;
            p.cvalue = a.θ[1] ;
            p.expcval = round(sum(a.C.*m.θₜ[:,1])/sum(a.C),digits=2) ;
            p.cstub = mean(a.C) ;
            break
        end
    end
    return p.ctime,p.crate,p.cvalue,p.expcval,p.cstub
end

For Cmin=0.5;Cmax=0.9;N=1000;outdeg=250;Pᵣ=0.0;σ=10;Pᵢ=0.5;, I get

@time main(Cmin,Cmax,N,outdeg,Pᵣ,σ,Pᵢ) 
32.085691 seconds (20.75 M allocations: 25.131 GiB, 15.87% gc time)
(316, 1, 1.5760757023650616, 1.57, 0.6931634326318095)

The allocations are in the for-loop. Try re-writing it so that the fields of a are modified in-place, not re-assigned to new arrays (making the Arr struct immutable might help).

What have your optimization efforts been spent on so far? When you profile the code, where does it show time being spent?

1 Like

I tried using @btime to benchmark each command step-by-step. The initialisation of variables and arrays is what takes considerable time (ms). I have ensured that the code is type-stable, despite that there is considerable gc that I am unable to get rid off.

Can I please know what it means to be modified in-place? If I make it immutable, it doesn’t let me add contents to the array.

I don’t think that’s what you want to get the step-by-step breakdown. Try using https://github.com/KristofferC/TimerOutputs.jl .

using TimerOutputs
const to = TimerOutput()

using Distributions
using StatsBase
using LightGraphs
using LinearAlgebra

mutable struct Par
    g::SimpleDiGraph{Int64} ;
    μ::Int64 ; 
    n::Int64 ; 
    P::Int64 ; 
    ctime::Int64 ; 
    cstub::Float64 ; 
    crate::Int64 ; 
    cvalue::Float64 ; 
    expcval::Float64 ; 
    temp::Bool ;
end

mutable struct Arr
    θ::Array{Float64,1} ;
    C::Array{Float64,1} ; 
    wts::Array{Float64,1} ; 
    upagents::Array{Int64,1} ; 
    edg::Array{Int64,1} ; 
    cedg::Array{Int64,1} ; 
    temp::Array{Float64,1} ;
end

mutable struct Mat
    θₜ::Array{Float64,2} ; 
end

function main(Cmin::Float64,Cmax::Float64,N::Int64,outdeg::Int64,Pᵣ::Float64,σ::Int64,Pᵢ::Float64)
    p = Par(SimpleDiGraph(Int64),90,5000,Int64(N/2),0,0.0,0,0.0,0.0,false) ;
    a = Arr(Float64[],Float64[],Float64[],Int64[],Int64[],Int64[],Float64[]) ;
    m = Mat(Matrix{Float64}(undef,N,p.n)) ;
    p.g = watts_strogatz(N,outdeg,Páµ£,is_directed=true) ;
    a.θ = rand(Normal(p.μ,σ),N)*pi/180 ;
    a.C = rand(Uniform(Cmin,Cmax),N) ;
    a.wts = 1.0 .- a.C ;
    a.wts = a.wts/sum(a.wts) ;
    for t = 1:1:p.n
        m.θₜ[:,t] = a.θ ;
        @timeit to "sample" a.upagents = sample([1:1:N;],Weights(a.wts),p.P,replace=false) ;
        @timeit to "inner" for i ∈ a.upagents
            @timeit to "inner 1" a.edg = inneighbors(p.g,i) ;
            @timeit to "inner 2" a.cedg = setdiff([1:1:N;],a.edg) ;
            @timeit to "inner 3" a.cedg = setdiff(vcat(a.edg,a.cedg[rand(length(a.cedg)).<Páµ¢]),i) ;
            @timeit to "inner 4" a.temp = exp.(-abs.(m.θₜ[i,t] .-m.θₜ[a.cedg,t])/(1.0-a.C[i])) ;
            @timeit to "inner 5" a.temp = vcat(a.C[i],(a.temp*(1-a.C[i]))/sum(a.temp)) ;
            @timeit to "inner 6" a.cedg = vcat(i,a.cedg) ;
            @timeit to "inner 7" a.θ[i] = atan(sum(a.temp.*sin.(m.θₜ[a.cedg,t])),sum(a.temp.*cos.(m.θₜ[a.cedg,t]))) ;
        end
        a.temp = round.(a.θ,digits=2) ;
        p.temp = all(y->y.==a.temp[1],a.temp) ;
        @timeit to "if" if(p.temp==true)
            p.ctime = t ;
            p.crate = 1 ;
            p.cvalue = a.θ[1] ;
            p.expcval = round(sum(a.C.*m.θₜ[:,1])/sum(a.C),digits=2) ;
            p.cstub = mean(a.C) ;
            break
        end
    end
    return p.ctime,p.crate,p.cvalue,p.expcval,p.cstub
end
 Cmin=0.5;Cmax=0.9;N=1000;outdeg=250;Pᵣ=0.0;σ=10;Pᵢ=0.5
@time main(Cmin,Cmax,N,outdeg,Pᵣ,σ,Pᵢ)

julia> to
 ────────────────────────────────────────────────────────────────────
                             Time                   Allocations      
                     ──────────────────────   ───────────────────────
  Tot / % measured:       22.5s / 86.7%           24.8GiB / 100%     

 Section     ncalls     time   %tot     avg     alloc   %tot      avg
 ────────────────────────────────────────────────────────────────────
 inner          280    19.5s   100%  69.6ms   24.7GiB  100%   90.3MiB
   inner 2     140k    8.32s  42.6%  59.4μs   9.39GiB  38.0%  70.3KiB
   inner 3     140k    5.83s  29.8%  41.6μs   7.20GiB  29.1%  53.9KiB
   inner 7     140k    2.32s  11.9%  16.6μs   2.44GiB  9.87%  18.3KiB
   inner 4     140k    1.34s  6.87%  9.58μs   3.04GiB  12.3%  22.7KiB
   inner 5     140k    937ms  4.80%  6.69μs   1.92GiB  7.76%  14.4KiB
   inner 6     140k    447ms  2.29%  3.19μs    719MiB  2.84%  5.26KiB
   inner 1     140k   25.2ms  0.13%   180ns     0.00B  0.00%    0.00B
 sample         280   44.5ms  0.23%   159μs   5.46MiB  0.02%  20.0KiB
 if             280    105μs  0.00%   375ns   15.9KiB  0.00%    58.1B
 ────────────────────────────────────────────────────────────────────

It looks like most of your time is spent in the setdiff lines.

4 Likes

As an alternative to using TimperOutputs, which measures actual runtime, you can use Julias built-in statistical Profiler to find the performance bottlenecks in your code. This has the advantage that you do not need to modify your code - you don’t have to introduce @timeit statements everywhere. In Juno, for example, you only have to replace your @time measurement by

using Profile
Profile.clear() # to be safe
Juno.@profiler main(Cmin, Cmax, N, outdeg, Pᵣ, σ, Pᵢ)

to get the following:

You can manually inspect the flame chart on the right to find your bottlenecks. Even greater though is that Juno highlights the bottlenecks for you in your code! See that the setdiff lines on the left are highlighted by bars.

Some useful links:
Julia documentation on Profiling
Juno documentation on Profiling
ProfileView.jl (if you don’t want to use Juno)

11 Likes

What people are worried about talking about modifying in place is this (20.75 M allocations: 25.131 GiB, 15.87% gc time). That are a lot of allocations.

Take this example:

julia> A = rand(5000);  

julia> B = rand(5000);

julia> C = similar(A);

julia> @btime $C = $A .* $B;
  4.471 μs (2 allocations: 39.14 KiB)

julia> @btime $C .= $A .* $B;  # <-- notice the extra dot
  2.232 μs (0 allocations: 0 bytes)  # <-- and notice that we don't have any allocations here

In the first version A * B gets calculated and the result gets stored to a newly created array. C is then bound to that array. But allocating a new array takes time. In the second version the already existing array C is used to store the results immediately – without creating a temporary array, which makes it faster. This is called modifying in place.

Now, it’s not always as easy as simply putting a dot in there (on why this works in this case read this) but every time you allocate arrays in a loop make sure if you really need to do it this way or if you could have reused an existing array.
Avoiding unnecessary allocations also includes stuff like

julia> @btime sum($A .* $B)
  5.818 μs (2 allocations: 39.14 KiB)  # <-- this allocates
1240.3540535219686                     # but you never even get to see an array!

By the way, I sometimes used a semicolon to suppress output here but I think none of the semicolon in your code are needed (except here: Cmin=0.5;Cmax=0.9;N=1000;outdeg=250;Pᵣ=0.0;σ=10;Pᵢ=0.5; (except the last one))

2 Likes

Thank you. I tried declaring some arrays a.θ,a.C,a.wts to be of fixed type and size zeros(Float64,N) and then,
`a.θ.=rand(Normal(p.μ,σ),N)*pi/180;

a.C. = rand(Uniform(Cmin,Cmax),N) ;

a.wts .= 1.0 .- a.C ;

a.wts. = a.wts/sum(a.wts) ;`

I couldn’t do this for the rest of the arrays since their size changes in every iteration. But this deteriorates the performance. Is there something wrong?

I think you need

using Random : rand!
rand!(Normal(p.μ,σ), a.θ) # fills a.θ in-place
a.θ .= deg2rad.(a.θ)

rand!(Uniform(Cmin,Cmax), a.C)

a.wts .= 1.0 .- a.C
a.wts. /= sum(a.wts)

In your code above, only a.wts .= 1.0 .- a.C doesn’t allocate.
a.θ.=rand(Normal(p.μ,σ),N)*pi/180 - allocates because rand() allocates and then rand() * pi allocates once more, and then prev / 180 allocates again, and finally you’re copying the result into a.θ (binding, as you’ve done before, indeed was cheaper).
a.C. = rand(Uniform(Cmin,Cmax),N) allocates because rand() allocates.
a.wts. = a.wts/sum(a.wts) because the division without dot-syntax allocates a new array for the result. Use a.wts .= a.wts ./ sum(a.wts) (note the dot before division) or a.wts ./= sum(a.wts) (my personal preference).

For the arrays that have to change the size, there are similarly destructive functions - e.g. setdiff!().

1 Like

The easiest gains are narray = [1:1:N;] then use it in the loop.

Then you can write some things in for loops to remove temporary arrays:

@timeit to "inner 7" a.θ[i] = begin
    s = 0.0
    c = 0.0
    for (i, x) in enumerate(m.θₜ[a.cedg, t])
        s += a.temp[i] * sin(x)
        c += a.temp[i] * cos(x)
    end
    atan(s, c)
end

But after that you pretty much need to rethink the algorithm. That’s where most of your speedup will come from. sum, setdiff and vcat wont compete with manually doing more of the process - with your own code in loops. You want to reduce the number of allocations, then array reads and writes as much as possible, but a bunch of those functions you are calling are built around doing those things.

1 Like

Have you tried using @inbound for the loops? I have found this results in significant performance improvement for some of my Julia code.

Thank you for all your suggestions. I have tried them:

using Distributions
using StatsBase
using LightGraphs
using LinearAlgebra
using Random
using BenchmarkTools

mutable struct Arr
    θ::Array{Float64,1}
    C::Array{Float64,1}
    G::SimpleDiGraph{Int64}
    W::Array{Float64,1}
    θₜ::Array{Any,1}
end

function main(Cmin::Float64,Cmax::Float64,N::Int64,n::Int64,outdeg::Int64,Pᵣ::Float64,Pᵢ::Float64,P::Int64,μ::Float64,σ::Float64)
    A=Arr(Array{Float64}(undef,N),Array{Float64}(undef,N),watts_strogatz(N,outdeg,Páµ£,is_directed=true),Float64[],Any[])
    rand!(Normal(μ,σ),A.θ)
    round.(rand!(Uniform(Cmin,Cmax),A.C),digits=2)
    A.W=1.0.-A.C
    A.W/=sum(A.W)
    @inbounds for t in 1:n
        push!(A.θₜ,A.θ)
        upagents=Int64[]
        upagents=sample([1:1:N;],Weights(A.W),P,replace=false)
        @inbounds for i in upagents
            cedg=Any[]
            cedg=setdiff([1:1:N;],inneighbors(A.G,i))
            edg=Any[]
            @inbounds for j in 1:length(cedg)
                if(rand()<Páµ¢&&cedg[j]!=i)
                    push!(edg,cedg[j])
                end
            end
            cedg=exp.(-abs.(A.θₜ[t][i].-A.θₜ[t][edg])/(1.0-A.C[i]))*(1.0-A.C[i])
            cedg/=sum(cedg)
            s=0.0
            c=0.0
            @inbounds for k in eachindex(cedg)
                s+=cedg[k]*sin(A.θₜ[t][edg[k]])
                c+=cedg[k]*cos(A.θₜ[t][edg[k]])
            end
            s+=A.C[i]*sin(A.θₜ[t][i])
            c+=A.C[i]*cos(A.θₜ[t][i])
            A.θ[i] = atan(s,c)
        end
        edg=round.(A.θ,digits=2)
        if(all(y->y.==edg[1],edg)==true)
            break
        end
    end
    return A.θₜ
end

But, for Cmin=0.5;Cmax=0.9;N=1000;n=5000;outdeg=250;Pᵣ=0.0;Pᵢ=0.5;P=500;μ=pi/2;σ=pi/18;,

@btime otpt=main(Cmin,Cmax,N,n,outdeg,Pᵣ,Pᵢ,P,μ,σ)

I still get a lot of allocations

15.750 s (220359025 allocations: 6.87 GiB)

Is there something more that I could do?

1 Like

Only quickly looking over your code, there are a couple of things that immediately jump in my eyes:

  • Why all those explicit Any[] arrays?
  • The [1:1:N;] objects in the sample calls allocate in each loop. You should preallocate them once. Also, maybe simply 1:N would also be fine?
  • Instead of creating empty arrays and pushing to them it is often more efficient to preallocate the necessary memory upfront and writing to specific slots of the array in the loops.
  • cedg/=sum(cedg) this allocates since you aren’t overwriting the entries of the array in-place.
  • edg=round.(A.θ,digits=2). First, this also allocates. Second, why round (and store) the values explicitly only for using it in a comparison. Instead you should be able to use isapprox with the correct atol set per keyword argument.

Hope that helps.

1 Like

Thank you for your time.

  1. I am trying to reuse those arrays (edg, cedg ). In the first instance cedg is used to store integers and in the second it is used to store floating-point numbers. And so is edg.
  2. Will try that
  3. I am not sure if I am sounding ignorant. But I have used the syntaxes after computing the memory and the time they take to execute (using @btime). I noticed that allocating memory upfront proves to be more costly time-wise.
  4. Will try
  5. I am trying to check if all entries of the array are similar. I learnt that isapprox can be used to compare two arrays or numbers, but not the entries of an array.

There’s something wrong here. round does not work in-place, and you’re not capturing the output.

3 Likes

You’re right, I have fixed that.

I have another question regarding the push! function used in the outermost for loop. The intention is to store value of A.θ at current time t. However, upon return A.θₜ all the entries in A.θₜ are updated with the value of A.θ in the final iteration.

I am trying to reuse those arrays ( edg, cedg ). In the first instance cedg is used to store integers and in the second it is used to store floating-point numbers. And so is edg .

Don’t do that. If the arrays are typed as Any, each element will just be a pointer to the actual element, so you aren’t reusing the space anyways. Plus, there will be type stability issues which could slow things down a lot.

How about floats only?

oops wrong reply. @ennvvy

You didn’t add my suggestions… This is allocating an identical 1000-integer array inside the loop, it should be the first thing to move out:

[1:1:N;]

I also profiled your code. The loop instead of the boradcast reduces a lot of allocations, and you can repeat that pattern in other places. But your bottleneck is setdiff.