How to make this piece of code faster? optimising it or using multiple threads?

ennvvy · November 10, 2019, 4:53am

I tried what you suggested. I get the following result

 6.819 s (121276468 allocations: 3.79 GiB)

Raf · November 10, 2019, 8:25am

Allocating this [1:1:N] outside of the loop pretty much can’t be more costly overall… Its a fundamental performance of julia and C and basically all programming languages. Allocating memory is slower than accessing memory.

ennvvy · November 11, 2019, 5:12am

I tried using filter!(x->!(x in inneighbors(A.G,i)),cedg), but it doesn’t seem to improve significantly.

ennvvy · November 11, 2019, 6:41am

@carstenbauer,@Karajan,@Raf:

I realise that most of the computations yield results that are of type Float64. Is there a way to ensure that by default all the inputs and outputs are in either Int32 or Float32 format? Would this help save time?
Is Int64[] or Float64[] the best way to initialise an array whose size varies in every iteration?

DNF · November 11, 2019, 9:06am

I doubt that is where you can gain performance. That’s a highly specialized optimization, that most likely will not help at all in this case.
Int64[] is best for initializing integer arrays, and Float64[] is best for initializing float arrays

Perhaps you can update your posted code? You have replied that you have changed several things, but I cannot see that you are updating the code itself.

GunnarFarneback · November 11, 2019, 9:16am

Yes.

No.

Well, that’s not quite true. Potentially it can but until you have managed to remove all allocations within your loops it would at best be of marginal benefit.

The key at this point is to understand your algorithm. What do you actually need to compute and what do you actually need to store? And even more importantly, what do you have to recompute within each loop and what can be precomputed outside the loop?

To take a concrete example, that has been highlighted as expensive in the profiling, what do you need the setdiff computation for? You clearly don’t need it long term since it’s discarded fairly quickly. What it’s used for is to iterate over the nodes in the graph that are not inneighbors to the current node. Maybe you can just loop over all nodes and query the graph in some other way whether there’s an edge between the nodes in question? Depending on how the graph is stored this may or may not be more efficient. Or you can use a temporary array, that you allocate only once, to mark the inneighbors and refer to that in a loop over all nodes, like in this modification (I also de-aliased the uses of edg but otherwise left things as they were):

function main(Cmin::Float64,Cmax::Float64,N::Int64,n::Int64,outdeg::Int64,Pᵣ::Float64,Pᵢ::Float64,P::Int64,μ::Float64,σ::Float64)
    A=Arr(Array{Float64}(undef,N),Array{Float64}(undef,N),watts_strogatz(N,outdeg,Pᵣ,is_directed=true),Float64[],Any[])
    rand!(Normal(μ,σ),A.θ)
    round.(rand!(Uniform(Cmin,Cmax),A.C),digits=2)
    A.W=1.0.-A.C
    A.W/=sum(A.W)
    current_neighbors = falses(N)
    @inbounds for t in 1:n
        push!(A.θₜ,A.θ)
        upagents=Int64[]
        upagents=sample([1:1:N;],Weights(A.W),P,replace=false)
        @inbounds for i in upagents
            current_neighbors .= false
            for j in inneighbors(A.G, i)
                current_neighbors[j] = true
            end
            current_neighbors[i] = true
            
            edg = Int[]
            for j = 1:N
                if !current_neighbors[j] && rand() < Pᵢ
                    push!(edg, j)
                end
            end

            cedg=exp.(-abs.(A.θₜ[t][i].-A.θₜ[t][edg])/(1.0-A.C[i]))*(1.0-A.C[i])
            cedg/=sum(cedg)
            s=0.0
            c=0.0
            @inbounds for k in eachindex(cedg)
                s+=cedg[k]*sin(A.θₜ[t][edg[k]])
                c+=cedg[k]*cos(A.θₜ[t][edg[k]])
            end
            s+=A.C[i]*sin(A.θₜ[t][i])
            c+=A.C[i]*cos(A.θₜ[t][i])
            A.θ[i] = atan(s,c)
        end
        edg2=round.(A.θ,digits=2)
        if(all(y->y.==edg2[1],edg2)==true)
            break
        end
    end
    return A.θₜ
end

The next thing you need, and maybe you already have this, is a way to verify that the implementation is doing the right thing. My changes speed things up but maybe I made a mistake and it computes the wrong result faster. That you really need to know.

For more gains think about this. Do you really need to store edg and cedg at all or can you make the necessary computations on the fly?

ChrisRackauckas · November 11, 2019, 2:22pm

you mean Int[] so it doesn’t necessarily fail on Win32

DNF · November 11, 2019, 2:35pm

I considered that, but concluded that @ennvvy specifically wanted Int64.

ennvvy · November 18, 2019, 11:09am

@carstenbauer,@Karajan,@Raf,@DNF: Thank you so much for the leads. I have now managed to get the following performance for the same parametric values as earlier:

using Distributions,Distributed,StatsBase,LinearAlgebra,LightGraphs,BenchmarkTools,Random

mutable struct Arr
    t::Int64
    agents::Array{Int64,1}
    θ::Array{Float64,1}
    C::Array{Float64,1}
    G::SimpleDiGraph{Int64}
    W::Array{Float64,1}
end

function main(Cmin::Float64,Cmax::Float64,N::Int64,n::Int64,outdeg::Int64,Pᵣ::Float64,Pᵢ::Float64,P::Int64,μ::Float64,σ::Float64)
    A=Arr(0,[1:N;],Array{Float64}(undef,N),Array{Float64}(undef,N),watts_strogatz(N,outdeg,Pᵣ,is_directed=true),
    Array{Float64}(undef,N))
    rand!(Normal(μ,σ),A.θ)
    rand!(Uniform(Cmin,Cmax),A.C)
    A.W.=1.0.-A.C
    A.W./=sum(A.W)
    θₜ=Vector{Float64}[]
    upagents=Array{Int64}[]
    cedg=Array{Float64}(undef,N)
    while(iszero(1 .-isapprox.(A.θ[1],A.θ,atol=1e-2))==false)
        A.t+=1
        push!(θₜ,copy(A.θ))
        upagents=sample(A.agents,Weights(A.W),P,replace=false)
        @inbounds for i in upagents
            cedg.=0.0
            @inbounds for j in inneighbors(A.G, i) 
              if(rand()<Pᵢ && i!=j)
                cedg[j]=exp(-abs(θₜ[A.t][i]-θₜ[A.t][j])/(1.0-A.C[i]))
              end
            end
            cedg./=(sum(cedg)/(1.0-A.C[i]))
            cedg[i]=A.C[i]
            s=0.0
            c=0.0
            @inbounds for k in 1:N
              s+=cedg[k]*sin(θₜ[A.t][k])
              c+=cedg[k]*cos(θₜ[A.t][k])
            end
            A.θ[i]=atan(s,c)
        end
    end
    return A.t,A.θ[1],sum(A.C.*θₜ[1])/sum(A.C),mean(A.C)
end

Cmin=0.5;Cmax=0.9;N=1000;n=5000;outdeg=250;Pᵣ=0.0;Pᵢ=0.5;P=500;μ=pi/2;σ=pi/18;
@btime main(Cmin,Cmax,N,n,outdeg,Pᵣ,Pᵢ,P,μ,σ)

Yields:

2.674 s (14911 allocations: 9.13 MiB)

baggepinnen · November 18, 2019, 11:40am

There’s also a function sincos which is faster than sin and cos individually. Not sure if it’s a bottleneck here though.

GunnarFarneback · November 18, 2019, 2:58pm

If sin and cos are bottlenecks there should be bigger gains from the observation that, except for the first iteration, their arguments are the result of an atan computation. It would take some extra bookkeeping and complexity to take advantage of it though.

But this looks like a very low-hanging fruit. No need to compute sin and cos just to multiply them by zero:

            @inbounds for k in 1:N
                if cedg[k] != 0
                    s+=cedg[k]*sin(θₜ[A.t][k])
                    c+=cedg[k]*cos(θₜ[A.t][k])
                end
            end

Vasily_Pisarev · November 18, 2019, 7:56pm

1 .- isapprox.(A.θ[1],A.θ,atol=1e-2)

is clearly a source of allocations.
I’d formulate the loop condition as

while any(abs(x - A.θ[1]) > 0.01 for x in A.θ)

Also,

upagents = sample(...)

is a candidate for optimization. It appears that you want

sample!(A.agents, Weights(A.W), P, upagents, replace=false)

ennvvy · November 19, 2019, 4:56am

@GunnarFarneback: Your suggestion significantly lowered the execution time to the order of ms
@Vasily_Pisarev:Your suggestions did lower the allocations

The revised code:

using Distributions,Distributed,StatsBase,LinearAlgebra,LightGraphs,BenchmarkTools,Random,Plots

mutable struct Arr
    t::Int64
    agents::Array{Int64,1}
    θ::Array{Float64,1}
    C::Array{Float64,1}
    G::SimpleDiGraph{Int64}
    W::Array{Float64,1}
end

function main(Cmin::Float64,Cmax::Float64,N::Int64,n::Int64,outdeg::Int64,Pᵣ::Float64,Pᵢ::Float64,P::Int64,μ::Float64,σ::Float64)
    A=Arr(0,[1:N;],Array{Float64}(undef,N),Array{Float64}(undef,N),watts_strogatz(N,outdeg,Pᵣ,is_directed=true),
    Array{Float64}(undef,N))
    rand!(Normal(μ,σ),A.θ)
    rand!(Uniform(Cmin,Cmax),A.C)
    A.W.=1.0.-A.C
    A.W./=sum(A.W)
    θₜ=Vector{Float64}[]
    upagents=Array{Int64}(undef,round(Int,0.5*N))
    cedg=Array{Float64}(undef,N)
    while(any(abs(x - A.θ[1]) > 0.01 for x in A.θ))
        A.t+=1
        push!(θₜ,copy(A.θ))
        sample!(Random.GLOBAL_RNG,A.agents, Weights(A.W), upagents, replace=false)
        @inbounds for i in upagents
            cedg.=0.0
            @inbounds for j in inneighbors(A.G, i) 
              if(rand()<Pᵢ && i!=j)
                cedg[j]=exp(-abs(θₜ[A.t][i]-θₜ[A.t][j])/(1.0-A.C[i]))
              end
            end
            cedg./=(sum(cedg)/(1.0-A.C[i]))
            cedg[i]=A.C[i]
            s=0.0
            c=0.0
            @inbounds for k in 1:N
              if(cedg[k]!=0)
                s+=cedg[k]*sin(θₜ[A.t][k])
                c+=cedg[k]*cos(θₜ[A.t][k])
              end
            end
            A.θ[i]=atan(s,c)
        end
    end
    #return A.t,A.θ[1],sum(A.C.*θₜ[1])/sum(A.C),mean(A.C)
    return θₜ
end

Cmin=0.5;Cmax=0.9;N=1000;n=5000;outdeg=250;Pᵣ=0.0;Pᵢ=0.5;P=500;μ=pi/2;σ=pi/18;
@btime main(Cmin,Cmax,N,n,outdeg,Pᵣ,Pᵢ,P,μ,σ)
589.232 ms (14481 allocations: 6.67 MiB)

To be able to simulate for network sizes larger than 10,000, I am planning to tweak this code (the inner for loops) to be able to run on GPU. Is that the right approach? Using the above code, I get the following performance:

Cmin=0.5;Cmax=0.9;N=10000;n=5000;outdeg=2500;Pᵣ=0.0;Pᵢ=0.5;P=5000;μ=pi/2;σ=pi/18;
@btime main(Cmin,Cmax,N,n,outdeg,Pᵣ,Pᵢ,P,μ,σ)
68.707 s (220977 allocations: 663.53 MiB)

baggepinnen · November 19, 2019, 5:12am

With this modification

                if(cedg[k]!=0)
                    si,ci = sincos(θₜ[A.t][k])
                    s+=si
                    c+=ci
                end

I get it to run significantly faster

  452.775 ms (14568 allocations: 7.12 MiB) # Before
  54.714 ms (14106 allocations: 4.74 MiB) # After

It’s actually a bit strange that the runtime is cut in more than half by this. Maybe sincos allows further optimizations to be made, like simd:ing the loop or something.

ennvvy · November 19, 2019, 5:29am

Thats really surprising. When I tried using sincos there wasnt any obvious difference in the time and allocation. This is what I got:

672.970 ms (14515 allocations: 6.85 MiB)

baggepinnen · November 19, 2019, 5:42am

It seems your timing of 500+ ms is a bit higher than what I see, maybe you just have a different processor that does not have the same instruction set as the one I use?

ennvvy · November 19, 2019, 5:44am

I am currently running it on google colab. It has an Intel(R) Xeon(R) CPU @ 2.30GHz with a 12GB RAM

baggepinnen · November 19, 2019, 5:46am

Ah, then I would not be surprised if the full potential of the processor can not be used. Have you tried running the example locally, on a reasonably modern CPU?

ennvvy · November 19, 2019, 6:48am

I tried it on an 2.4 GHz Dual-Core Intel Core i5. It still shows similar performance.

667.504 ms (14488 allocations: 6.71 MiB)

Inside the for loop while adding si and ci values to s and c they are multiplied by weights A.cedg[k] in the original code. Could that make the difference?

GunnarFarneback · November 19, 2019, 6:55am

It seems unlikely that that loop could get any automatic SIMD optimizations and there shouldn’t be any allocations to gain from your change. I would guess that the randomness has resulted in different iteration counts for your two runs.

Btw, it’s extremely hard to do a fair benchmarking with the randomness and the loop until convergence. Even with a fixed seed, @btime will vary the number of runs if you do something that significantly affects the runtime, which can give entirely different numbers.