Thanks for the PR to add instructions. My goal is to get all the dependencies of ProbabilityModels to a point where I am comfortable registering them and recomending their use. ProbabilityModel’s Manifest.toml currently contains 16 unregistered dependencies – making it much worse to install than LoopVectorization (which is one of those dependencies).
I don’t think the branches are too problematic (and using ? instead of ifesle is probably fine / may get compiled to the same thing if it matters), but the time saved on not calculating sincos is a major win.
So, let’s dive a little further into the vectorized approach.
The computer and Julia version I am using now is:
julia> versioninfo()
Julia Version 1.4.0-DEV.493
Commit 6ac632e324* (2019-11-17 18:40 UTC)
Platform Info:
OS: Linux (x86_64-generic-linux)
CPU: Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-8.0.1 (ORCJIT, skylake)
Importantly, the CPU has the avx512 instruction set. This means both that it’ll see more benefit from vectorization (vectors are twice as wide), but also that it is better at vectorizing code.
You broke vectorizing sincos into two parts:
Consolidating the non-zero values
Vectorized loop to evaluate sincos.
First, I want to talk a little more about 2. For many functions, the C++ library xsimd is much better optimized than SLEEF. So instead of using the C library or musm’s Julia port, we’ll use xsimd.
Pardon the awkward code summing the return value. Without it, Julia will crash with:
Here is the issue.
xsimd sincos is about 50% faster when benchmarked by itself:
julia> using BenchmarkTools, SLEEFPirates, xsimdwrap, SIMDPirates
julia> function sinccos_wrap(f::F, x) where {F}
s, c = f(x)
vsum(vadd(s, c))
end
sinccos_wrap (generic function with 1 method)
julia> x = ntuple(Val(8)) do i Core.VecElement(randn()) end
(VecElement{Float64}(-2.52217072669755), VecElement{Float64}(-0.09354991120444209), VecElement{Float64}(-0.07468916632960443), VecElement{Float64}(0.8452263812340517), VecElement{Float64}(0.017722087042160287), VecElement{Float64}(2.2032089188297372), VecElement{Float64}(-0.8076465796539984), VecElement{Float64}(1.2255421950288716))
julia> @benchmark sinccos_wrap(SLEEFPirates.sincos, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 23.417 ns (0.00% GC)
median time: 24.017 ns (0.00% GC)
mean time: 24.423 ns (0.00% GC)
maximum time: 57.857 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 996
julia> @benchmark sinccos_wrap(xsimdwrap.sincos, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 14.618 ns (0.00% GC)
median time: 15.080 ns (0.00% GC)
mean time: 15.478 ns (0.00% GC)
maximum time: 42.776 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 998
However, when we run a benchmark in a loop, SLEEFPirates is faster. I suspect that is because the function inlines, allowing many of the loads (ie, loads for polynomial coefficients and other constants) to be hoisted out of the loop:
julia> using BenchmarkTools, LoopVectorization, xsimdwrap
julia> θ = randn(1000); c = randn(1000);
julia> function sumsc_sleefpirates(θ::AbstractArray{Float64}, coef::AbstractArray{Float64})
s, c = 0.0, 0.0
@vvectorize for i ∈ eachindex(θ)
sinθᵢ, cosθᵢ = sincos(θ[i])
s += coef[i] * sinθᵢ
c += coef[i] * cosθᵢ
end
s, c
end
sumsc_sleefpirates (generic function with 1 method)
julia> function sumsc_xsimd(θ::AbstractArray{Float64}, coef::AbstractArray{Float64})
s, c = 0.0, 0.0
@xvectorize for i ∈ eachindex(θ)
sinθᵢ, cosθᵢ = sincos(θ[i])
s += coef[i] * sinθᵢ
c += coef[i] * cosθᵢ
end
s, c
end
sumsc_xsimd (generic function with 1 method)
julia> function sumsc_serial(θ::AbstractArray{Float64}, coef::AbstractArray{Float64})
s, c = 0.0, 0.0
@inbounds for i ∈ eachindex(θ)
sinθᵢ, cosθᵢ = sincos(θ[i])
s += coef[i] * sinθᵢ
c += coef[i] * cosθᵢ
end
s, c
end
sumsc_serial (generic function with 1 method)
julia> @btime sumsc_serial($θ, $c)
10.817 μs (0 allocations: 0 bytes)
(6.166055099088835, -17.378963888315667)
julia> @btime sumsc_xsimd($θ, $c)
1.500 μs (0 allocations: 0 bytes)
(6.166055099088833, -17.378963888315653)
julia> @btime sumsc_sleefpirates($θ, $c)
1.262 μs (0 allocations: 0 bytes)
(6.166055099088833, -17.378963888315653)
Eventually, it would be nice to get a Julia port of xsimd. I may do that eventually, but it would be nice if I could steal someone else’s work (all the real work of SLEEFPirates was done in SLEEF by musm!).
Depending on the length of the loop, it’s worth trying both xsimd and SLEEFPirates.
Note that xsimdwrap depends on LoopVectorization, but it just supplies a different dictionary for function substitutions so that xsimd functions get used instead.
Being a C++ library, building xsimdwrap is also a little trickier than installing a Julia library. The build script expects you to have git and a C++ compiler installed (it checks for a CXX environmental variable, defaulting to g++ if none is defined).
Now, let’s focus on 1.
using Random, BenchmarkTools
Random.seed!(1);
cedgu = randn(200);
cedgu[rand(1:200, 120)] .= 0;
sum(iszero, cedgu) # 91
buf = similar(cedgu);
cedgc = similar(cedgu);
θₜ = randn(length(cedgu));
function compress1!(buf, cedgc, cedgu, θₜ)
j = 1
@inbounds for i in eachindex(cedgu)
cedgui = cedgu[i]
cedgc[j] = cedgui
buf[j] = θₜ[i]
j += ifelse(cedgui != 0, 1, 0)
end
j - 1
end
function compress2!(buf, cedgc, cedgu, θₜ)
j = 1
@inbounds for i in eachindex(cedgu)
cedgui = cedgu[i]
if cedgui != 0
cedgc[j] = cedgui
buf[j] = θₜ[i]
j += 1
end
end
j - 1
end
compress1!(buf, cedgc, cedgu, θₜ) # 109 = 200 - 91
compress2!(buf, cedgc, cedgu, θₜ) # 109 = 200 - 91
So I see version two as actually slightly faster. But without testing it in the actual function, that could just be because the branch predictor memorized the sequence of branches in compress2!.
Can we do better?
avx512 adds a bunch of a new vector instructions, including a compressing store. So let’s try that.
Note that without avx512, LLVM 8 and earlier will crash on this code, while LLVM 9 will work and get the correct answer – but probably be much slower.
These instructions take in a bitmask – an unsigned integer – storing only numbers corresponding to 1-bit, adjacent to each other. So if you have UInt(41)
julia> bitstring(UInt8(41))
"00101001"
and you had a vector <1.,2.,3.,4.,5.,6.,7.,8.>, it would store the 1,4, and 6 side by side into the vector a (read the bits from right to left):
So that is over 2x faster.
Again, that’s only with avx512. Otherwise, Julia will crash unless you happened to compile Julia with LLVM 9 from source, in which case it’ll work but probably wont be faster.
Let’s time them.
Code for each of the versions:
using StatsBase, LinearAlgebra, LightGraphs, BenchmarkTools, Random
using LoopVectorization, VectorizationBase, SIMDPirates, xsimdwrap
mutable struct Arr
t::Int64
θ::Array{Float64,1}
C::Array{Float64,1}
G::SimpleDiGraph{Int64}
W::Array{Float64,1}
upagents::Array{Int64,1}
cedg::Array{Float64,1}
cn::Array{Bool,1}
end
function main(Cmin::Float64,Cmax::Float64,N::Int64,n::Int64,outdeg::Int64,Pᵣ::Float64,Pᵢ::Float64,P::Int64,μ::Float64,σ::Float64)
A=Arr(0,Array{Float64}(undef,N),Array{Float64}(undef,N),watts_strogatz(N,outdeg,Pᵣ,is_directed=true),
Array{Float64}(undef,N),Array{Int64}(undef,round(Int,P)),Array{Float64}(undef,N),falses(N))
randn!(A.θ)
A.θ .= muladd.(σ, A.θ, μ)
rand!(A.C)
δC = Cmax - Cmin
sumAw = 0.0
@inbounds @simd ivdep for i ∈ eachindex(A.C)
Cᵢ = muladd(A.C[i], δC, Cmin)
A.C[i] = Cᵢ
Wᵢ = 1.0 - Cᵢ
A.W[i] = Wᵢ
sumAw += Wᵢ
end
A.W .*= inv(sumAw)
θₜ=Vector{Float64}[]
agents=range(1,length=N)
while(any(abs(x - A.θ[1])>0.01 for x in A.θ))
A.t+=1
push!(θₜ,copy(A.θ))
sample!(Random.GLOBAL_RNG,agents,Weights(A.W),A.upagents,replace=false)
@inbounds for i in A.upagents
A.cedg.=0.0
@inbounds for j in inneighbors(A.G, i)
if rand()<Pᵢ && i!=j
A.cedg[j]=exp(-abs(θₜ[A.t][i]-θₜ[A.t][j])/(1.0-A.C[i]))
end
end
A.cedg .*= ((1.0-A.C[i]) / sum(A.cedg))
A.cedg[i]=A.C[i]
s=0.0
c=0.0
@inbounds for l in 1:N
if(A.cedg[l]!=0)
s+=A.cedg[l]*sin(θₜ[A.t][l])
c+=A.cedg[l]*cos(θₜ[A.t][l])
end
end
A.θ[i]=atan(s,c)
end
end
return A.t,A.θ[1],sum(A.C.*θₜ[1])/sum(A.C)
#return θₜ
end
function main_vec1(Cmin::Float64,Cmax::Float64,N::Int64,n::Int64,outdeg::Int64,Pᵣ::Float64,Pᵢ::Float64,P::Int64,μ::Float64,σ::Float64)
A=Arr(0,Array{Float64}(undef,N),Array{Float64}(undef,N),watts_strogatz(N,outdeg,Pᵣ,is_directed=true),
Array{Float64}(undef,N),Array{Int64}(undef,round(Int,P)),Array{Float64}(undef,N),falses(N))
randn!(A.θ)
A.θ .= muladd.(σ, A.θ, μ)
rand!(A.C)
δC = Cmax - Cmin
sumAw = 0.0
@inbounds @simd ivdep for i ∈ eachindex(A.C)
Cᵢ = muladd(A.C[i], δC, Cmin)
A.C[i] = Cᵢ
Wᵢ = 1.0 - Cᵢ
A.W[i] = Wᵢ
sumAw += Wᵢ
end
A.W .*= inv(sumAw)
θₜ=Vector{Float64}[]
θbuf = similar(A.θ)
agents=range(1,length=N)
while any(abs(x - A.θ[1])>0.01 for x in A.θ)
A.t+=1
push!(θₜ,copy(A.θ))
sample!(Random.GLOBAL_RNG,agents,Weights(A.W),A.upagents,replace=false)
@inbounds for i in A.upagents
A.cedg.=0.0
@inbounds for j in inneighbors(A.G, i)
if rand()<Pᵢ && i!=j
A.cedg[j]=exp(-abs(θₜ[A.t][i]-θₜ[A.t][j])/(1.0-A.C[i]))
end
end
A.cedg .*= ((1.0-A.C[i]) / sum(A.cedg))
A.cedg[i]=A.C[i]
cedg = A.cedg
j = 1
@inbounds for i in eachindex(cedg)
cedg[j] = cedg[i]
θbuf[j] = θₜ[A.t][i]
j += ifelse(cedg[i] != 0, 1, 0)
end
s=0.0
c=0.0
@vvectorize for k in 1:j-1
si,ci = sincos(θbuf[k])
s += cedg[k]*si
c += cedg[k]*ci
end
A.θ[i]=atan(s,c)
end
end
return A.t,A.θ[1],sum(A.C.*θₜ[1])/sum(A.C)
#return θₜ
end
function compress3!(
buf::AbstractVector{T},
cedgc::AbstractVector{T},
cedgu::AbstractVector{T},
θₜ::AbstractVector{T}
) where {T}
V = VectorizationBase.pick_vector(T)
W, Wshift = VectorizationBase.pick_vector_width_shift(T)
N = length(cedgu)
Nrep = N >>> Wshift
Nrem = N & (W - 1)
j = 0
size_T = sizeof(T)
vzero = SIMDPirates.vbroadcast(V, zero(T))
GC.@preserve buf cedgc cedgu θₜ begin
ptr_buf = pointer(buf)
ptr_cedgc = pointer(cedgc)
ptr_cedgu = pointer(cedgu)
ptr_θₜ = pointer(θₜ)
for _ ∈ 1:Nrep
vcedgu = vload(V, ptr_cedgu)
mask = SIMDPirates.vnot_equal_mask(vcedgu, vzero)
vθₜ = vload(V, ptr_θₜ)
SIMDPirates.compressstore!(ptr_cedgc, vcedgu, mask)
SIMDPirates.compressstore!(ptr_buf, vθₜ, mask)
nones = count_ones(mask)
j += nones
ptr_cedgc += nones*size_T
ptr_buf += nones*size_T
ptr_cedgu += W*size_T
ptr_θₜ += W*size_T
end
vcedgu = vload(V, ptr_cedgu, VectorizationBase.mask(T, Nrem))
mask = SIMDPirates.vnot_equal_mask(vcedgu, vzero)
vθₜ = vload(V, ptr_θₜ)
SIMDPirates.compressstore!(ptr_cedgc, vcedgu, mask)
SIMDPirates.compressstore!(ptr_buf, vθₜ, mask)
end
nones = count_ones(mask)
j += nones
j
end
function main_vec2(Cmin::Float64,Cmax::Float64,N::Int64,n::Int64,outdeg::Int64,Pᵣ::Float64,Pᵢ::Float64,P::Int64,μ::Float64,σ::Float64)
A=Arr(0,Array{Float64}(undef,N),Array{Float64}(undef,N),watts_strogatz(N,outdeg,Pᵣ,is_directed=true),
Array{Float64}(undef,N),Array{Int64}(undef,round(Int,P)),Array{Float64}(undef,N),falses(N))
randn!(A.θ)
A.θ .= muladd.(σ, A.θ, μ)
rand!(A.C)
δC = Cmax - Cmin
sumAw = 0.0
@inbounds @simd ivdep for i ∈ eachindex(A.C)
Cᵢ = muladd(A.C[i], δC, Cmin)
A.C[i] = Cᵢ
Wᵢ = 1.0 - Cᵢ
A.W[i] = Wᵢ
sumAw += Wᵢ
end
A.W .*= inv(sumAw)
θₜ=Vector{Float64}[]
θbuf = similar(A.θ)
agents=range(1,length=N)
while any(abs(x - A.θ[1])>0.01 for x in A.θ)
A.t+=1
push!(θₜ,copy(A.θ))
sample!(Random.GLOBAL_RNG,agents,Weights(A.W),A.upagents,replace=false)
@inbounds for i in A.upagents
A.cedg.=0.0
@inbounds for j in inneighbors(A.G, i)
if rand()<Pᵢ && i!=j
A.cedg[j]=exp(-abs(θₜ[A.t][i]-θₜ[A.t][j])/(1.0-A.C[i]))
end
end
A.cedg .*= ((1.0-A.C[i]) / sum(A.cedg))
A.cedg[i]=A.C[i]
cedg = A.cedg
j = compress3!( θbuf, cedg, cedg, θₜ[A.t] )
s=0.0
c=0.0
@vvectorize for k in 1:j
si,ci = sincos(θbuf[k])
s += cedg[k]*si
c += cedg[k]*ci
end
A.θ[i]=atan(s,c)
end
end
return A.t,A.θ[1],sum(A.C.*θₜ[1])/sum(A.C)
#return θₜ
end
function main_vec3(Cmin::Float64,Cmax::Float64,N::Int64,n::Int64,outdeg::Int64,Pᵣ::Float64,Pᵢ::Float64,P::Int64,μ::Float64,σ::Float64)
A=Arr(0,Array{Float64}(undef,N),Array{Float64}(undef,N),watts_strogatz(N,outdeg,Pᵣ,is_directed=true),
Array{Float64}(undef,N),Array{Int64}(undef,round(Int,P)),Array{Float64}(undef,N),falses(N))
randn!(A.θ)
A.θ .= muladd.(σ, A.θ, μ)
rand!(A.C)
δC = Cmax - Cmin
sumAw = 0.0
@inbounds @simd ivdep for i ∈ eachindex(A.C)
Cᵢ = muladd(A.C[i], δC, Cmin)
A.C[i] = Cᵢ
Wᵢ = 1.0 - Cᵢ
A.W[i] = Wᵢ
sumAw += Wᵢ
end
A.W .*= inv(sumAw)
θₜ=Vector{Float64}[]
θbuf = similar(A.θ)
agents=range(1,length=N)
while any(abs(x - A.θ[1])>0.01 for x in A.θ)
A.t+=1
push!(θₜ,copy(A.θ))
sample!(Random.GLOBAL_RNG,agents,Weights(A.W),A.upagents,replace=false)
@inbounds for i in A.upagents
A.cedg.=0.0
@inbounds for j in inneighbors(A.G, i)
if rand()<Pᵢ && i!=j
A.cedg[j]=exp(-abs(θₜ[A.t][i]-θₜ[A.t][j])/(1.0-A.C[i]))
end
end
A.cedg .*= ((1.0-A.C[i]) / sum(A.cedg))
A.cedg[i]=A.C[i]
cedg = A.cedg
j = compress3!( θbuf, cedg, cedg, θₜ[A.t] )
s=0.0
c=0.0
@xvectorize for k in 1:j
si,ci = sincos(θbuf[k])
s += cedg[k]*si
c += cedg[k]*ci
end
A.θ[i]=atan(s,c)
end
end
return A.t,A.θ[1],sum(A.C.*θₜ[1])/sum(A.C)
#return θₜ
end
Cmin=0.5;Cmax=0.9;N=1000;n=5000;outdeg=250;Pᵣ=0.0;Pᵢ=0.5;P=500;μ=pi/2;σ=pi/18;
@benchmark main($Cmin,$Cmax,$N,$n,$outdeg,$Pᵣ,$Pᵢ,$P,$μ,$σ)
@benchmark main_vec1($Cmin,$Cmax,$N,$n,$outdeg,$Pᵣ,$Pᵢ,$P,$μ,$σ)
@benchmark main_vec2($Cmin,$Cmax,$N,$n,$outdeg,$Pᵣ,$Pᵢ,$P,$μ,$σ)
@benchmark main_vec3($Cmin,$Cmax,$N,$n,$outdeg,$Pᵣ,$Pᵢ,$P,$μ,$σ)
Running the benchmarks:
julia> Cmin=0.5;Cmax=0.9;N=1000;n=5000;outdeg=250;Pᵣ=0.0;Pᵢ=0.5;P=500;μ=pi/2;σ=pi/18;
julia> @benchmark main($Cmin,$Cmax,$N,$n,$outdeg,$Pᵣ,$Pᵢ,$P,$μ,$σ)
BenchmarkTools.Trial:
memory estimate: 6.85 MiB
allocs estimate: 14520
--------------
minimum time: 341.824 ms (0.00% GC)
median time: 381.903 ms (0.00% GC)
mean time: 377.373 ms (0.06% GC)
maximum time: 417.986 ms (0.38% GC)
--------------
samples: 14
evals/sample: 1
julia> @benchmark main_vec1($Cmin,$Cmax,$N,$n,$outdeg,$Pᵣ,$Pᵢ,$P,$μ,$σ)
BenchmarkTools.Trial:
memory estimate: 6.77 MiB
allocs estimate: 14503
--------------
minimum time: 270.561 ms (0.00% GC)
median time: 313.175 ms (0.00% GC)
mean time: 327.182 ms (0.06% GC)
maximum time: 503.980 ms (0.00% GC)
--------------
samples: 16
evals/sample: 1
julia> @benchmark main_vec2($Cmin,$Cmax,$N,$n,$outdeg,$Pᵣ,$Pᵢ,$P,$μ,$σ)
BenchmarkTools.Trial:
memory estimate: 6.71 MiB
allocs estimate: 14491
--------------
minimum time: 202.743 ms (0.00% GC)
median time: 238.547 ms (0.00% GC)
mean time: 248.084 ms (0.10% GC)
maximum time: 311.537 ms (0.00% GC)
--------------
samples: 21
evals/sample: 1
julia> @benchmark main_vec3($Cmin,$Cmax,$N,$n,$outdeg,$Pᵣ,$Pᵢ,$P,$μ,$σ)
BenchmarkTools.Trial:
memory estimate: 6.83 MiB
allocs estimate: 14515
--------------
minimum time: 213.319 ms (0.00% GC)
median time: 240.131 ms (0.00% GC)
mean time: 247.002 ms (0.10% GC)
maximum time: 292.142 ms (0.00% GC)
--------------
samples: 21
evals/sample: 1
Note that @btime returns the minimum, so it went from 342 ns without vectorization, to 271 ns with SLEEFPirates-vectorized sincos, to 203 ns with SLEEFPirates-vectorized sincos and vectorized compression, and 213 ns with xsimd-vectorized sincos and vectorized compression.
Caveat on the vectorized compressions again being that they require avx512.
From there, I’d recomend profiling to see where the code is spending most of it’s time before trying anything else.
Wow The compression stuff was unknown to me. Any hopes julia will expose such a compress function that does the right thing for various CPUs?
Is this a general pattern for avoiding branches or is this kind of a custom solution that fits this particular problem? Feels like it could be generally applicable whenever there is a somewhat expensive loop body that fails to vectorize due to a branch that omits some computations. It seems the savings from vectorizing in this example was on the order of 10us whereas the cost of the compression (without avx512 stuff) was on the order of 100ns. The maximum overall saving of a factor ~1.7 seems quite nice for this example.
I happen to know from a slack discussion that the OP is trying to implement parts of the algorithm on a GPU. My intuition was to disregard the branch and simply do the calculation for the whole vector, wasting some calculations that are multiplied by 0 (I have not tried it out )
I recommend doing algorithmic optimization before vectorization.
function main2(Cmin::Float64,Cmax::Float64,N::Int64,n::Int64,outdeg::Int64,Pᵣ::Float64,Pᵢ::Float64,P::Int64,μ::Float64,σ::Float64)
Random.seed!(1)
A=Arr(0,Array{Float64}(undef,N),Array{Float64}(undef,N),watts_strogatz(N,outdeg,Pᵣ,is_directed=true),
Array{Float64}(undef,N),Array{Int64}(undef,round(Int,P)),Array{Float64}(undef,N),falses(N))
randn!(A.θ)
A.θ .= muladd.(σ, A.θ, μ)
sinθ = sin.(A.θ)
cosθ = cos.(A.θ)
sinθ_next = copy(sinθ)
cosθ_next = copy(cosθ)
rand!(A.C)
δC = Cmax - Cmin
sumAw = 0.0
@inbounds @simd ivdep for i ∈ eachindex(A.C)
Cᵢ = muladd(A.C[i], δC, Cmin)
A.C[i] = Cᵢ
Wᵢ = 1.0 - Cᵢ
A.W[i] = Wᵢ
sumAw += Wᵢ
end
A.W .*= inv(sumAw)
θₜ=Vector{Float64}[]
agents=range(1,length=N)
while(any(abs(x - A.θ[1])>0.01 for x in A.θ))
A.t+=1
push!(θₜ,copy(A.θ))
sample!(Random.GLOBAL_RNG,agents,Weights(A.W),A.upagents,replace=false)
@inbounds for i in A.upagents
s=0.0
c=0.0
cedg_sum = 0.0
for j in inneighbors(A.G, i)
if rand()<Pᵢ && i!=j
cedg_j = exp(-abs(θₜ[A.t][i]-θₜ[A.t][j])/(1.0-A.C[i]))
cedg_sum += cedg_j
s += cedg_j * sinθ[j]
c += cedg_j * cosθ[j]
end
end
cedg_i = A.C[i] * cedg_sum / (1.0-A.C[i])
s += cedg_i * sinθ[i]
c += cedg_i * cosθ[i]
A.θ[i]=atan(s,c)
α = 1 / sqrt(s * s + c * c)
sinθ_next[i] = s * α
cosθ_next[i] = c * α
end
copy!(sinθ, sinθ_next)
copy!(cosθ, cosθ_next)
end
return A.t,A.θ[1],sum(A.C.*θₜ[1])/sum(A.C)
end
(Notice that I added a fixed seed at the start of the function.)
On my computer the timing compared to your baseline main, with the same seed added, is
It should be added that as noted earlier in the thread (about message 48) the loop we’re using here isn’t quite the right one, so the timings might turn out differently if that is fixed.
And yes, I do appreciate seeing new vectorization tricks.
Yes, I have been trying to port parts of this code on to the GPU. However, coming up with an alternate logic for some of the conditional statements inside the for loop are turning out to be difficult.
I tried running a part of the code on GPU, but it takes longer
mutable struct Arr
t::Int64
θ::Array{Float64,1}
C::Array{Float64,1}
G::SimpleDiGraph{Int64}
W::Array{Float64,1}
upagents::Array{Int64,1}
cedg::Array{Float64,1}
cn::Array{Bool,1}
end
function update!(s::CuArray,c::CuArray,cedg::CuArray,θ::CuArray)
s .= cedg.*sin.(θ)
c .= cedg.*cos.(θ)
end
function main(Cmin::Float64,Cmax::Float64,N::Int64,n::Int64,outdeg::Int64,Pᵣ::Float64,Pᵢ::Float64,P::Int64,μ::Float64,σ::Float64)
A=Arr(0,Array{Float64}(undef,N),Array{Float64}(undef,N),watts_strogatz(N,outdeg,Pᵣ,is_directed=true),
Array{Float64}(undef,N),Array{Int64}(undef,round(Int,P)),Array{Float64}(undef,N),falses(N))
rand!(Normal(μ,σ),A.θ)
rand!(Uniform(Cmin,Cmax),A.C)
A.W.=1.0.-A.C
A.W./=sum(A.W)
θₜ=Vector{Float64}[]
agents=range(1,length=N)
s=CuArrays.fill(0.0f0,N)
c=CuArrays.fill(0.0f0,N)
while(any(abs(x - A.θ[1])>0.01 for x in A.θ))
A.t+=1
push!(θₜ,copy(A.θ))
sample!(Random.GLOBAL_RNG,agents,Weights(A.W),A.upagents,replace=false)
@inbounds for h in A.upagents
A.cedg.=0.0
A.cn.=false
for j in inneighbors(A.G,h)
A.cn[j]=true
end
@inbounds for i in 1:N
if((A.cn[i]==true)||(A.cn[i]==false && rand()<Pᵢ))
A.cedg[i]=exp(-abs(θₜ[A.t][h]-θₜ[A.t][i])/(1.0-A.C[h]))
end
end
A.cedg[h]=0.0
A.cedg.*=((1.0-A.C[h])/sum(A.cedg))
A.cedg[h]=A.C[h]
s.=0.0f0
c.=0.0f0
update!(s,c,cu(A.cedg),cu(θₜ[A.t]))
A.θ[h]=atan(sum(Array(s)),sum(Array(c)))
end
end
return A.t,A.θ[1],sum(A.C.*θₜ[1])/sum(A.C)
end
For the following values:
using BenchmarkTools
Cmin=0.5;Cmax=0.9;N=1000;n=5000;outdeg=Int(0.25*N);Pᵣ=0.0;Pᵢ=0.5;P=Int(0.5*N);μ=pi/2;σ=pi/18;
@btime main(Cmin,Cmax,N,n,outdeg,Pᵣ,Pᵢ,P,μ,σ)
Yields,
10.538 s (22077831 allocations: 2.92 GiB)
(187, 1.5737587213516235, 1.5750543941387851)
It takes excruciatingly long to run this code for N=10000
Sending data back and forth between cpu and gpu is costly. Why not perform the sum on the gpu?
Also, if you’re going to run stuff on the gpu, I would guess you need to have the entire algorithm on the gpu to avoid excessive data transfer.
I tried having the inner for-loop (for h in upagents) on the GPU. But, I couldn’t find alternatives to a few commands that are in there. I posted about it here - Rewriting function on CPU for execution on GPU