I tried what you suggested. I get the following result
6.819 s (121276468 allocations: 3.79 GiB)
I tried what you suggested. I get the following result
6.819 s (121276468 allocations: 3.79 GiB)
Allocating this [1:1:N]
outside of the loop pretty much can’t be more costly overall… Its a fundamental performance of julia and C and basically all programming languages. Allocating memory is slower than accessing memory.
I tried using filter!(x->!(x in inneighbors(A.G,i)),cedg)
, but it doesn’t seem to improve significantly.
Int32
or Float32
format? Would this help save time?Int64[]
or Float64[]
the best way to initialise an array whose size varies in every iteration?I doubt that is where you can gain performance. That’s a highly specialized optimization, that most likely will not help at all in this case.
Int64[]
is best for initializing integer arrays, and Float64[]
is best for initializing float arrays
Perhaps you can update your posted code? You have replied that you have changed several things, but I cannot see that you are updating the code itself.
Yes.
No.
Well, that’s not quite true. Potentially it can but until you have managed to remove all allocations within your loops it would at best be of marginal benefit.
The key at this point is to understand your algorithm. What do you actually need to compute and what do you actually need to store? And even more importantly, what do you have to recompute within each loop and what can be precomputed outside the loop?
To take a concrete example, that has been highlighted as expensive in the profiling, what do you need the setdiff computation for? You clearly don’t need it long term since it’s discarded fairly quickly. What it’s used for is to iterate over the nodes in the graph that are not inneighbors to the current node. Maybe you can just loop over all nodes and query the graph in some other way whether there’s an edge between the nodes in question? Depending on how the graph is stored this may or may not be more efficient. Or you can use a temporary array, that you allocate only once, to mark the inneighbors and refer to that in a loop over all nodes, like in this modification (I also de-aliased the uses of edg
but otherwise left things as they were):
function main(Cmin::Float64,Cmax::Float64,N::Int64,n::Int64,outdeg::Int64,Pᵣ::Float64,Pᵢ::Float64,P::Int64,μ::Float64,σ::Float64)
A=Arr(Array{Float64}(undef,N),Array{Float64}(undef,N),watts_strogatz(N,outdeg,Pᵣ,is_directed=true),Float64[],Any[])
rand!(Normal(μ,σ),A.θ)
round.(rand!(Uniform(Cmin,Cmax),A.C),digits=2)
A.W=1.0.-A.C
A.W/=sum(A.W)
current_neighbors = falses(N)
@inbounds for t in 1:n
push!(A.θₜ,A.θ)
upagents=Int64[]
upagents=sample([1:1:N;],Weights(A.W),P,replace=false)
@inbounds for i in upagents
current_neighbors .= false
for j in inneighbors(A.G, i)
current_neighbors[j] = true
end
current_neighbors[i] = true
edg = Int[]
for j = 1:N
if !current_neighbors[j] && rand() < Pᵢ
push!(edg, j)
end
end
cedg=exp.(-abs.(A.θₜ[t][i].-A.θₜ[t][edg])/(1.0-A.C[i]))*(1.0-A.C[i])
cedg/=sum(cedg)
s=0.0
c=0.0
@inbounds for k in eachindex(cedg)
s+=cedg[k]*sin(A.θₜ[t][edg[k]])
c+=cedg[k]*cos(A.θₜ[t][edg[k]])
end
s+=A.C[i]*sin(A.θₜ[t][i])
c+=A.C[i]*cos(A.θₜ[t][i])
A.θ[i] = atan(s,c)
end
edg2=round.(A.θ,digits=2)
if(all(y->y.==edg2[1],edg2)==true)
break
end
end
return A.θₜ
end
The next thing you need, and maybe you already have this, is a way to verify that the implementation is doing the right thing. My changes speed things up but maybe I made a mistake and it computes the wrong result faster. That you really need to know.
For more gains think about this. Do you really need to store edg
and cedg
at all or can you make the necessary computations on the fly?
you mean Int[]
so it doesn’t necessarily fail on Win32
I considered that, but concluded that @ennvvy specifically wanted Int64
.
@carstenbauer,@Karajan,@Raf,@DNF: Thank you so much for the leads. I have now managed to get the following performance for the same parametric values as earlier:
using Distributions,Distributed,StatsBase,LinearAlgebra,LightGraphs,BenchmarkTools,Random
mutable struct Arr
t::Int64
agents::Array{Int64,1}
θ::Array{Float64,1}
C::Array{Float64,1}
G::SimpleDiGraph{Int64}
W::Array{Float64,1}
end
function main(Cmin::Float64,Cmax::Float64,N::Int64,n::Int64,outdeg::Int64,Pᵣ::Float64,Pᵢ::Float64,P::Int64,μ::Float64,σ::Float64)
A=Arr(0,[1:N;],Array{Float64}(undef,N),Array{Float64}(undef,N),watts_strogatz(N,outdeg,Pᵣ,is_directed=true),
Array{Float64}(undef,N))
rand!(Normal(μ,σ),A.θ)
rand!(Uniform(Cmin,Cmax),A.C)
A.W.=1.0.-A.C
A.W./=sum(A.W)
θₜ=Vector{Float64}[]
upagents=Array{Int64}[]
cedg=Array{Float64}(undef,N)
while(iszero(1 .-isapprox.(A.θ[1],A.θ,atol=1e-2))==false)
A.t+=1
push!(θₜ,copy(A.θ))
upagents=sample(A.agents,Weights(A.W),P,replace=false)
@inbounds for i in upagents
cedg.=0.0
@inbounds for j in inneighbors(A.G, i)
if(rand()<Pᵢ && i!=j)
cedg[j]=exp(-abs(θₜ[A.t][i]-θₜ[A.t][j])/(1.0-A.C[i]))
end
end
cedg./=(sum(cedg)/(1.0-A.C[i]))
cedg[i]=A.C[i]
s=0.0
c=0.0
@inbounds for k in 1:N
s+=cedg[k]*sin(θₜ[A.t][k])
c+=cedg[k]*cos(θₜ[A.t][k])
end
A.θ[i]=atan(s,c)
end
end
return A.t,A.θ[1],sum(A.C.*θₜ[1])/sum(A.C),mean(A.C)
end
Cmin=0.5;Cmax=0.9;N=1000;n=5000;outdeg=250;Pᵣ=0.0;Pᵢ=0.5;P=500;μ=pi/2;σ=pi/18;
@btime main(Cmin,Cmax,N,n,outdeg,Pᵣ,Pᵢ,P,μ,σ)
Yields:
2.674 s (14911 allocations: 9.13 MiB)
There’s also a function sincos
which is faster than sin and cos individually. Not sure if it’s a bottleneck here though.
If sin
and cos
are bottlenecks there should be bigger gains from the observation that, except for the first iteration, their arguments are the result of an atan
computation. It would take some extra bookkeeping and complexity to take advantage of it though.
But this looks like a very low-hanging fruit. No need to compute sin
and cos
just to multiply them by zero:
@inbounds for k in 1:N
if cedg[k] != 0
s+=cedg[k]*sin(θₜ[A.t][k])
c+=cedg[k]*cos(θₜ[A.t][k])
end
end
1 .- isapprox.(A.θ[1],A.θ,atol=1e-2)
is clearly a source of allocations.
I’d formulate the loop condition as
while any(abs(x - A.θ[1]) > 0.01 for x in A.θ)
Also,
upagents = sample(...)
is a candidate for optimization. It appears that you want
sample!(A.agents, Weights(A.W), P, upagents, replace=false)
@GunnarFarneback: Your suggestion significantly lowered the execution time to the order of ms
@Vasily_Pisarev:Your suggestions did lower the allocations
The revised code:
using Distributions,Distributed,StatsBase,LinearAlgebra,LightGraphs,BenchmarkTools,Random,Plots
mutable struct Arr
t::Int64
agents::Array{Int64,1}
θ::Array{Float64,1}
C::Array{Float64,1}
G::SimpleDiGraph{Int64}
W::Array{Float64,1}
end
function main(Cmin::Float64,Cmax::Float64,N::Int64,n::Int64,outdeg::Int64,Pᵣ::Float64,Pᵢ::Float64,P::Int64,μ::Float64,σ::Float64)
A=Arr(0,[1:N;],Array{Float64}(undef,N),Array{Float64}(undef,N),watts_strogatz(N,outdeg,Pᵣ,is_directed=true),
Array{Float64}(undef,N))
rand!(Normal(μ,σ),A.θ)
rand!(Uniform(Cmin,Cmax),A.C)
A.W.=1.0.-A.C
A.W./=sum(A.W)
θₜ=Vector{Float64}[]
upagents=Array{Int64}(undef,round(Int,0.5*N))
cedg=Array{Float64}(undef,N)
while(any(abs(x - A.θ[1]) > 0.01 for x in A.θ))
A.t+=1
push!(θₜ,copy(A.θ))
sample!(Random.GLOBAL_RNG,A.agents, Weights(A.W), upagents, replace=false)
@inbounds for i in upagents
cedg.=0.0
@inbounds for j in inneighbors(A.G, i)
if(rand()<Pᵢ && i!=j)
cedg[j]=exp(-abs(θₜ[A.t][i]-θₜ[A.t][j])/(1.0-A.C[i]))
end
end
cedg./=(sum(cedg)/(1.0-A.C[i]))
cedg[i]=A.C[i]
s=0.0
c=0.0
@inbounds for k in 1:N
if(cedg[k]!=0)
s+=cedg[k]*sin(θₜ[A.t][k])
c+=cedg[k]*cos(θₜ[A.t][k])
end
end
A.θ[i]=atan(s,c)
end
end
#return A.t,A.θ[1],sum(A.C.*θₜ[1])/sum(A.C),mean(A.C)
return θₜ
end
Cmin=0.5;Cmax=0.9;N=1000;n=5000;outdeg=250;Pᵣ=0.0;Pᵢ=0.5;P=500;μ=pi/2;σ=pi/18;
@btime main(Cmin,Cmax,N,n,outdeg,Pᵣ,Pᵢ,P,μ,σ)
589.232 ms (14481 allocations: 6.67 MiB)
To be able to simulate for network sizes larger than 10,000, I am planning to tweak this code (the inner for loops) to be able to run on GPU. Is that the right approach? Using the above code, I get the following performance:
Cmin=0.5;Cmax=0.9;N=10000;n=5000;outdeg=2500;Pᵣ=0.0;Pᵢ=0.5;P=5000;μ=pi/2;σ=pi/18;
@btime main(Cmin,Cmax,N,n,outdeg,Pᵣ,Pᵢ,P,μ,σ)
68.707 s (220977 allocations: 663.53 MiB)
With this modification
if(cedg[k]!=0)
si,ci = sincos(θₜ[A.t][k])
s+=si
c+=ci
end
I get it to run significantly faster
452.775 ms (14568 allocations: 7.12 MiB) # Before
54.714 ms (14106 allocations: 4.74 MiB) # After
It’s actually a bit strange that the runtime is cut in more than half by this. Maybe sincos
allows further optimizations to be made, like simd:ing the loop or something.
Thats really surprising. When I tried using sincos
there wasnt any obvious difference in the time and allocation. This is what I got:
672.970 ms (14515 allocations: 6.85 MiB)
It seems your timing of 500+ ms is a bit higher than what I see, maybe you just have a different processor that does not have the same instruction set as the one I use?
I am currently running it on google colab. It has an Intel(R) Xeon(R) CPU @ 2.30GHz with a 12GB RAM
Ah, then I would not be surprised if the full potential of the processor can not be used. Have you tried running the example locally, on a reasonably modern CPU?
I tried it on an 2.4 GHz Dual-Core Intel Core i5. It still shows similar performance.
667.504 ms (14488 allocations: 6.71 MiB)
Inside the for loop
while adding si
and ci
values to s
and c
they are multiplied by weights A.cedg[k]
in the original code. Could that make the difference?
It seems unlikely that that loop could get any automatic SIMD optimizations and there shouldn’t be any allocations to gain from your change. I would guess that the randomness has resulted in different iteration counts for your two runs.
Btw, it’s extremely hard to do a fair benchmarking with the randomness and the loop until convergence. Even with a fixed seed, @btime
will vary the number of runs if you do something that significantly affects the runtime, which can give entirely different numbers.