How to make this piece of code faster? optimising it or using multiple threads?

Ah, the package is not registered, you must provide the full url

using Pkg
Pkg.add(PackageSpec(url="https://github.com/chriselrod/VectorizationBase.jl"))
Pkg.add(PackageSpec(url="https://github.com/chriselrod/SIMDPirates.jl"))
Pkg.add(PackageSpec(url="https://github.com/chriselrod/SLEEFPirates.jl"))
Pkg.add(PackageSpec(url="https://github.com/chriselrod/LoopVectorization.jl"))

Edit: sorry, there were some weird invincible characters before some Pkg, should be updated now.

1 Like

Thanks for the PR to add instructions. My goal is to get all the dependencies of ProbabilityModels to a point where I am comfortable registering them and recomending their use. ProbabilityModel’s Manifest.toml currently contains 16 unregistered dependencies – making it much worse to install than LoopVectorization (which is one of those dependencies).

I don’t think the branches are too problematic (and using ? instead of ifesle is probably fine / may get compiled to the same thing if it matters), but the time saved on not calculating sincos is a major win.

So, let’s dive a little further into the vectorized approach.
The computer and Julia version I am using now is:

julia> versioninfo()
Julia Version 1.4.0-DEV.493
Commit 6ac632e324* (2019-11-17 18:40 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)

Importantly, the CPU has the avx512 instruction set. This means both that it’ll see more benefit from vectorization (vectors are twice as wide), but also that it is better at vectorizing code.

You broke vectorizing sincos into two parts:

  1. Consolidating the non-zero values
  2. Vectorized loop to evaluate sincos.

First, I want to talk a little more about 2. For many functions, the C++ library xsimd is much better optimized than SLEEF. So instead of using the C library or musm’s Julia port, we’ll use xsimd.

Pardon the awkward code summing the return value. Without it, Julia will crash with:

julia: /home/chriselrod/Documents/languages/julia/src/cgutils.cpp:514: unsigned int convert_struct_offset(llvm::Type*, unsigned int): Assertion `SL->getElementOffset(idx) == byte_offset' failed.

Here is the issue.
xsimd sincos is about 50% faster when benchmarked by itself:

julia> using BenchmarkTools, SLEEFPirates, xsimdwrap, SIMDPirates

julia> function sinccos_wrap(f::F, x) where {F}
           s, c = f(x)
           vsum(vadd(s, c))
       end
sinccos_wrap (generic function with 1 method)

julia> x = ntuple(Val(8)) do i Core.VecElement(randn()) end
(VecElement{Float64}(-2.52217072669755), VecElement{Float64}(-0.09354991120444209), VecElement{Float64}(-0.07468916632960443), VecElement{Float64}(0.8452263812340517), VecElement{Float64}(0.017722087042160287), VecElement{Float64}(2.2032089188297372), VecElement{Float64}(-0.8076465796539984), VecElement{Float64}(1.2255421950288716))

julia> @benchmark sinccos_wrap(SLEEFPirates.sincos, $x)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     23.417 ns (0.00% GC)
  median time:      24.017 ns (0.00% GC)
  mean time:        24.423 ns (0.00% GC)
  maximum time:     57.857 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     996

julia> @benchmark sinccos_wrap(xsimdwrap.sincos, $x)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     14.618 ns (0.00% GC)
  median time:      15.080 ns (0.00% GC)
  mean time:        15.478 ns (0.00% GC)
  maximum time:     42.776 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998

However, when we run a benchmark in a loop, SLEEFPirates is faster. I suspect that is because the function inlines, allowing many of the loads (ie, loads for polynomial coefficients and other constants) to be hoisted out of the loop:

julia> using BenchmarkTools, LoopVectorization, xsimdwrap

julia> θ = randn(1000); c = randn(1000);

julia> function sumsc_sleefpirates(θ::AbstractArray{Float64}, coef::AbstractArray{Float64})
           s, c = 0.0, 0.0
           @vvectorize for i ∈ eachindex(θ)
               sinθᵢ, cosθᵢ = sincos(θ[i])
               s += coef[i] * sinθᵢ
               c += coef[i] * cosθᵢ
           end
           s, c
       end
sumsc_sleefpirates (generic function with 1 method)

julia> function sumsc_xsimd(θ::AbstractArray{Float64}, coef::AbstractArray{Float64})
           s, c = 0.0, 0.0
           @xvectorize for i ∈ eachindex(θ)
               sinθᵢ, cosθᵢ = sincos(θ[i])
               s += coef[i] * sinθᵢ
               c += coef[i] * cosθᵢ
           end
           s, c
       end
sumsc_xsimd (generic function with 1 method)

julia> function sumsc_serial(θ::AbstractArray{Float64}, coef::AbstractArray{Float64})
           s, c = 0.0, 0.0
           @inbounds for i ∈ eachindex(θ)
               sinθᵢ, cosθᵢ = sincos(θ[i])
               s += coef[i] * sinθᵢ
               c += coef[i] * cosθᵢ
           end
           s, c
       end
sumsc_serial (generic function with 1 method)

julia> @btime sumsc_serial($θ, $c)
  10.817 μs (0 allocations: 0 bytes)
(6.166055099088835, -17.378963888315667)

julia> @btime sumsc_xsimd($θ, $c)
  1.500 μs (0 allocations: 0 bytes)
(6.166055099088833, -17.378963888315653)

julia> @btime sumsc_sleefpirates($θ, $c)
  1.262 μs (0 allocations: 0 bytes)
(6.166055099088833, -17.378963888315653)

Eventually, it would be nice to get a Julia port of xsimd. I may do that eventually, but it would be nice if I could steal someone else’s work (all the real work of SLEEFPirates was done in SLEEF by musm!).

Depending on the length of the loop, it’s worth trying both xsimd and SLEEFPirates.

Note that xsimdwrap depends on LoopVectorization, but it just supplies a different dictionary for function substitutions so that xsimd functions get used instead.

Being a C++ library, building xsimdwrap is also a little trickier than installing a Julia library. The build script expects you to have git and a C++ compiler installed (it checks for a CXX environmental variable, defaulting to g++ if none is defined).

Now, let’s focus on 1.

using Random, BenchmarkTools
Random.seed!(1);
cedgu = randn(200);
cedgu[rand(1:200, 120)] .= 0;
sum(iszero, cedgu) # 91
buf = similar(cedgu);
cedgc = similar(cedgu);
θₜ = randn(length(cedgu));

function compress1!(buf, cedgc, cedgu, θₜ)
    j = 1
    @inbounds for i in eachindex(cedgu)
        cedgui = cedgu[i]
        cedgc[j] = cedgui
        buf[j] = θₜ[i]
        j += ifelse(cedgui != 0, 1, 0)
    end
    j - 1
end
function compress2!(buf, cedgc, cedgu, θₜ)
    j = 1
    @inbounds for i in eachindex(cedgu)
        cedgui = cedgu[i]
        if cedgui != 0
            cedgc[j] = cedgui
            buf[j] = θₜ[i]
            j += 1
        end
    end
    j - 1
end

compress1!(buf, cedgc, cedgu, θₜ) # 109 = 200 - 91
compress2!(buf, cedgc, cedgu, θₜ) # 109 = 200 - 91

Now, benchmarking

julia> @benchmark compress1!($buf, $cedgc, $cedgu, $θₜ)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     131.115 ns (0.00% GC)
  median time:      140.322 ns (0.00% GC)
  mean time:        145.980 ns (0.00% GC)
  maximum time:     197.009 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     858

julia> @benchmark compress2!($buf, $cedgc, $cedgu, $θₜ)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     115.157 ns (0.00% GC)
  median time:      115.205 ns (0.00% GC)
  mean time:        119.126 ns (0.00% GC)
  maximum time:     255.910 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     918

So I see version two as actually slightly faster. But without testing it in the actual function, that could just be because the branch predictor memorized the sequence of branches in compress2!.

Can we do better?

avx512 adds a bunch of a new vector instructions, including a compressing store. So let’s try that.
Note that without avx512, LLVM 8 and earlier will crash on this code, while LLVM 9 will work and get the correct answer – but probably be much slower.

These instructions take in a bitmask – an unsigned integer – storing only numbers corresponding to 1-bit, adjacent to each other. So if you have UInt(41)

julia> bitstring(UInt8(41))
"00101001"

and you had a vector <1.,2.,3.,4.,5.,6.,7.,8.>, it would store the 1,4, and 6 side by side into the vector a (read the bits from right to left):

julia> a = collect(100.0:-1:1); a'
1×100 LinearAlgebra.Adjoint{Float64,Array{Float64,1}}:
 100.0  99.0  98.0  97.0  96.0  95.0  94.0  93.0  92.0  91.0  90.0  89.0  88.0  87.0  86.0  85.0  84.0  83.0  82.0  81.0  80.0  79.0  78.0  77.0  76.0  75.0  74.0  73.0  72.0  71.0  70.0  69.0  68.0  67.0  66.0  …  36.0  35.0  34.0  33.0  32.0  31.0  30.0  29.0  28.0  27.0  26.0  25.0  24.0  23.0  22.0  21.0  20.0  19.0  18.0  17.0  16.0  15.0  14.0  13.0  12.0  11.0  10.0  9.0  8.0  7.0  6.0  5.0  4.0  3.0  2.0  1.0

julia> SIMDPirates.compressstore!(pointer(a), ntuple(i -> Core.VecElement{Float64}(i), Val(8)), UInt8(41))

julia> a'
1×100 LinearAlgebra.Adjoint{Float64,Array{Float64,1}}:
 1.0  4.0  6.0  97.0  96.0  95.0  94.0  93.0  92.0  91.0  90.0  89.0  88.0  87.0  86.0  85.0  84.0  83.0  82.0  81.0  80.0  79.0  78.0  77.0  76.0  75.0  74.0  73.0  72.0  71.0  70.0  69.0  68.0  67.0  66.0  65.0  …  36.0  35.0  34.0  33.0  32.0  31.0  30.0  29.0  28.0  27.0  26.0  25.0  24.0  23.0  22.0  21.0  20.0  19.0  18.0  17.0  16.0  15.0  14.0  13.0  12.0  11.0  10.0  9.0  8.0  7.0  6.0  5.0  4.0  3.0  2.0  1.0

So we can use this for a vectorized compression.

using VectorizationBase, SIMDPirates

function compress3!(
    buf::AbstractVector{T},
    cedgc::AbstractVector{T},
    cedgu::AbstractVector{T},
    θₜ::AbstractVector{T}
) where {T}
    V = VectorizationBase.pick_vector(T)
    W, Wshift = VectorizationBase.pick_vector_width_shift(T)
    N = length(cedgu)
    Nrep = N >>> Wshift
    Nrem = N & (W - 1)
    j = 0
    size_T = sizeof(T)
    vzero = SIMDPirates.vbroadcast(V, zero(T))
    GC.@preserve buf cedgc cedgu θₜ begin
        ptr_buf = pointer(buf)
        ptr_cedgc = pointer(cedgc)
        ptr_cedgu = pointer(cedgu)
        ptr_θₜ = pointer(θₜ)
        for _ ∈ 1:Nrep
            vcedgu = vload(V, ptr_cedgu)
            mask = SIMDPirates.vnot_equal_mask(vcedgu, vzero)
            vθₜ = vload(V, ptr_θₜ)
            SIMDPirates.compressstore!(ptr_cedgc, vcedgu, mask)
            SIMDPirates.compressstore!(ptr_buf, vθₜ, mask)

            nones = count_ones(mask)
            j += nones
            ptr_cedgc += nones*size_T
            ptr_buf += nones*size_T
            ptr_cedgu += W*size_T
            ptr_θₜ += W*size_T
        end
        vcedgu = vload(V, ptr_cedgu, VectorizationBase.mask(T, Nrem))
        mask = SIMDPirates.vnot_equal_mask(vcedgu, vzero)
        vθₜ = vload(V, ptr_θₜ)
        SIMDPirates.compressstore!(ptr_cedgc, vcedgu, mask)
        SIMDPirates.compressstore!(ptr_buf, vθₜ, mask)
    end
    nones = count_ones(mask)
    j += nones

    j
end

Lets test and time it.

julia> compress3!(buf, cedgc, cedgu, θₜ)
109

julia> @benchmark compress3!($buf, $cedgc, $cedgu, $θₜ)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     44.515 ns (0.00% GC)
  median time:      45.807 ns (0.00% GC)
  mean time:        46.772 ns (0.00% GC)
  maximum time:     83.039 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     989

So that is over 2x faster.
Again, that’s only with avx512. Otherwise, Julia will crash unless you happened to compile Julia with LLVM 9 from source, in which case it’ll work but probably wont be faster.

Let’s time them.
Code for each of the versions:

using StatsBase, LinearAlgebra, LightGraphs, BenchmarkTools, Random
using LoopVectorization, VectorizationBase, SIMDPirates, xsimdwrap

mutable struct Arr
    t::Int64
    θ::Array{Float64,1}
    C::Array{Float64,1}
    G::SimpleDiGraph{Int64}
    W::Array{Float64,1}
    upagents::Array{Int64,1}
    cedg::Array{Float64,1}
    cn::Array{Bool,1}
end

function main(Cmin::Float64,Cmax::Float64,N::Int64,n::Int64,outdeg::Int64,Pᵣ::Float64,Pᵢ::Float64,P::Int64,μ::Float64,σ::Float64)
    A=Arr(0,Array{Float64}(undef,N),Array{Float64}(undef,N),watts_strogatz(N,outdeg,Pᵣ,is_directed=true),
          Array{Float64}(undef,N),Array{Int64}(undef,round(Int,P)),Array{Float64}(undef,N),falses(N))
    randn!(A.θ)
    A.θ .= muladd.(σ, A.θ, μ)
    rand!(A.C)
    δC = Cmax - Cmin
    sumAw = 0.0
    @inbounds @simd ivdep for i ∈ eachindex(A.C)
        Cᵢ = muladd(A.C[i], δC, Cmin)
        A.C[i] = Cᵢ
        Wᵢ = 1.0 - Cᵢ
        A.W[i] = Wᵢ
        sumAw += Wᵢ
    end
    A.W .*= inv(sumAw)
    θₜ=Vector{Float64}[]
    agents=range(1,length=N)
    while(any(abs(x - A.θ[1])>0.01 for x in A.θ))
        A.t+=1
        push!(θₜ,copy(A.θ))
        sample!(Random.GLOBAL_RNG,agents,Weights(A.W),A.upagents,replace=false)
        @inbounds for i in A.upagents
            A.cedg.=0.0
            @inbounds for j in inneighbors(A.G, i)
                if rand()<Pᵢ && i!=j
                    A.cedg[j]=exp(-abs(θₜ[A.t][i]-θₜ[A.t][j])/(1.0-A.C[i]))
                end
            end
            A.cedg .*= ((1.0-A.C[i]) / sum(A.cedg))
            A.cedg[i]=A.C[i]
            s=0.0
            c=0.0
            @inbounds for l in 1:N
                if(A.cedg[l]!=0)
                    s+=A.cedg[l]*sin(θₜ[A.t][l])
                    c+=A.cedg[l]*cos(θₜ[A.t][l])
                end
            end
            A.θ[i]=atan(s,c)
        end
    end
    return A.t,A.θ[1],sum(A.C.*θₜ[1])/sum(A.C)
    #return θₜ
end


function main_vec1(Cmin::Float64,Cmax::Float64,N::Int64,n::Int64,outdeg::Int64,Pᵣ::Float64,Pᵢ::Float64,P::Int64,μ::Float64,σ::Float64)
    A=Arr(0,Array{Float64}(undef,N),Array{Float64}(undef,N),watts_strogatz(N,outdeg,Pᵣ,is_directed=true),
          Array{Float64}(undef,N),Array{Int64}(undef,round(Int,P)),Array{Float64}(undef,N),falses(N))
    randn!(A.θ)
    A.θ .= muladd.(σ, A.θ, μ)
    rand!(A.C)
    δC = Cmax - Cmin
    sumAw = 0.0
    @inbounds @simd ivdep for i ∈ eachindex(A.C)
        Cᵢ = muladd(A.C[i], δC, Cmin)
        A.C[i] = Cᵢ
        Wᵢ = 1.0 - Cᵢ
        A.W[i] = Wᵢ
        sumAw += Wᵢ
    end
    A.W .*= inv(sumAw)
    θₜ=Vector{Float64}[]
    θbuf = similar(A.θ)
    agents=range(1,length=N)
    while any(abs(x - A.θ[1])>0.01 for x in A.θ)
        A.t+=1
        push!(θₜ,copy(A.θ))
        sample!(Random.GLOBAL_RNG,agents,Weights(A.W),A.upagents,replace=false)
        @inbounds for i in A.upagents
            A.cedg.=0.0
            @inbounds for j in inneighbors(A.G, i)
                if rand()<Pᵢ && i!=j
                    A.cedg[j]=exp(-abs(θₜ[A.t][i]-θₜ[A.t][j])/(1.0-A.C[i]))
                end
            end
            A.cedg .*= ((1.0-A.C[i]) / sum(A.cedg))
            A.cedg[i]=A.C[i]
            cedg = A.cedg
            j = 1
            @inbounds for i in eachindex(cedg)
                cedg[j] = cedg[i]
                θbuf[j] = θₜ[A.t][i]
                j += ifelse(cedg[i] != 0, 1, 0)
            end
            s=0.0
            c=0.0
            @vvectorize for k in 1:j-1
                si,ci = sincos(θbuf[k])
                s += cedg[k]*si
                c += cedg[k]*ci
            end
            A.θ[i]=atan(s,c)
        end
    end
    return A.t,A.θ[1],sum(A.C.*θₜ[1])/sum(A.C)
    #return θₜ
end


function compress3!(
    buf::AbstractVector{T},
    cedgc::AbstractVector{T},
    cedgu::AbstractVector{T},
    θₜ::AbstractVector{T}
) where {T}
    V = VectorizationBase.pick_vector(T)
    W, Wshift = VectorizationBase.pick_vector_width_shift(T)
    N = length(cedgu)
    Nrep = N >>> Wshift
    Nrem = N & (W - 1)
    j = 0
    size_T = sizeof(T)
    vzero = SIMDPirates.vbroadcast(V, zero(T))
    GC.@preserve buf cedgc cedgu θₜ begin
        ptr_buf = pointer(buf)
        ptr_cedgc = pointer(cedgc)
        ptr_cedgu = pointer(cedgu)
        ptr_θₜ = pointer(θₜ)
        for _ ∈ 1:Nrep
            vcedgu = vload(V, ptr_cedgu)
            mask = SIMDPirates.vnot_equal_mask(vcedgu, vzero)
            vθₜ = vload(V, ptr_θₜ)
            SIMDPirates.compressstore!(ptr_cedgc, vcedgu, mask)
            SIMDPirates.compressstore!(ptr_buf, vθₜ, mask)

            nones = count_ones(mask)
            j += nones
            ptr_cedgc += nones*size_T
            ptr_buf += nones*size_T
            ptr_cedgu += W*size_T
            ptr_θₜ += W*size_T
        end
        vcedgu = vload(V, ptr_cedgu, VectorizationBase.mask(T, Nrem))
        mask = SIMDPirates.vnot_equal_mask(vcedgu, vzero)
        vθₜ = vload(V, ptr_θₜ)
        SIMDPirates.compressstore!(ptr_cedgc, vcedgu, mask)
        SIMDPirates.compressstore!(ptr_buf, vθₜ, mask)
    end
    nones = count_ones(mask)
    j += nones

    j
end

function main_vec2(Cmin::Float64,Cmax::Float64,N::Int64,n::Int64,outdeg::Int64,Pᵣ::Float64,Pᵢ::Float64,P::Int64,μ::Float64,σ::Float64)
    A=Arr(0,Array{Float64}(undef,N),Array{Float64}(undef,N),watts_strogatz(N,outdeg,Pᵣ,is_directed=true),
          Array{Float64}(undef,N),Array{Int64}(undef,round(Int,P)),Array{Float64}(undef,N),falses(N))
    randn!(A.θ)
    A.θ .= muladd.(σ, A.θ, μ)
    rand!(A.C)
    δC = Cmax - Cmin
    sumAw = 0.0
    @inbounds @simd ivdep for i ∈ eachindex(A.C)
        Cᵢ = muladd(A.C[i], δC, Cmin)
        A.C[i] = Cᵢ
        Wᵢ = 1.0 - Cᵢ
        A.W[i] = Wᵢ
        sumAw += Wᵢ
    end
    A.W .*= inv(sumAw)
    θₜ=Vector{Float64}[]
    θbuf = similar(A.θ)
    agents=range(1,length=N)
    while any(abs(x - A.θ[1])>0.01 for x in A.θ)
        A.t+=1
        push!(θₜ,copy(A.θ))
        sample!(Random.GLOBAL_RNG,agents,Weights(A.W),A.upagents,replace=false)
        @inbounds for i in A.upagents
            A.cedg.=0.0
            @inbounds for j in inneighbors(A.G, i)
                if rand()<Pᵢ && i!=j
                    A.cedg[j]=exp(-abs(θₜ[A.t][i]-θₜ[A.t][j])/(1.0-A.C[i]))
                end
            end
            A.cedg .*= ((1.0-A.C[i]) / sum(A.cedg))
            A.cedg[i]=A.C[i]
            cedg = A.cedg
            j = compress3!( θbuf, cedg, cedg, θₜ[A.t] )
            s=0.0
            c=0.0
            @vvectorize for k in 1:j
                si,ci = sincos(θbuf[k])
                s += cedg[k]*si
                c += cedg[k]*ci
            end
            A.θ[i]=atan(s,c)
        end
    end
    return A.t,A.θ[1],sum(A.C.*θₜ[1])/sum(A.C)
    #return θₜ
end


function main_vec3(Cmin::Float64,Cmax::Float64,N::Int64,n::Int64,outdeg::Int64,Pᵣ::Float64,Pᵢ::Float64,P::Int64,μ::Float64,σ::Float64)
    A=Arr(0,Array{Float64}(undef,N),Array{Float64}(undef,N),watts_strogatz(N,outdeg,Pᵣ,is_directed=true),
          Array{Float64}(undef,N),Array{Int64}(undef,round(Int,P)),Array{Float64}(undef,N),falses(N))
    randn!(A.θ)
    A.θ .= muladd.(σ, A.θ, μ)
    rand!(A.C)
    δC = Cmax - Cmin
    sumAw = 0.0
    @inbounds @simd ivdep for i ∈ eachindex(A.C)
        Cᵢ = muladd(A.C[i], δC, Cmin)
        A.C[i] = Cᵢ
        Wᵢ = 1.0 - Cᵢ
        A.W[i] = Wᵢ
        sumAw += Wᵢ
    end
    A.W .*= inv(sumAw)
    θₜ=Vector{Float64}[]
    θbuf = similar(A.θ)
    agents=range(1,length=N)
    while any(abs(x - A.θ[1])>0.01 for x in A.θ)
        A.t+=1
        push!(θₜ,copy(A.θ))
        sample!(Random.GLOBAL_RNG,agents,Weights(A.W),A.upagents,replace=false)
        @inbounds for i in A.upagents
            A.cedg.=0.0
            @inbounds for j in inneighbors(A.G, i)
                if rand()<Pᵢ && i!=j
                    A.cedg[j]=exp(-abs(θₜ[A.t][i]-θₜ[A.t][j])/(1.0-A.C[i]))
                end
            end
            A.cedg .*= ((1.0-A.C[i]) / sum(A.cedg))
            A.cedg[i]=A.C[i]
            cedg = A.cedg
            j = compress3!( θbuf, cedg, cedg, θₜ[A.t] )
            s=0.0
            c=0.0
            @xvectorize for k in 1:j
                si,ci = sincos(θbuf[k])
                s += cedg[k]*si
                c += cedg[k]*ci
            end
            A.θ[i]=atan(s,c)
        end
    end
    return A.t,A.θ[1],sum(A.C.*θₜ[1])/sum(A.C)
    #return θₜ
end

Cmin=0.5;Cmax=0.9;N=1000;n=5000;outdeg=250;Pᵣ=0.0;Pᵢ=0.5;P=500;μ=pi/2;σ=pi/18;
@benchmark main($Cmin,$Cmax,$N,$n,$outdeg,$Pᵣ,$Pᵢ,$P,$μ,$σ)
@benchmark main_vec1($Cmin,$Cmax,$N,$n,$outdeg,$Pᵣ,$Pᵢ,$P,$μ,$σ)
@benchmark main_vec2($Cmin,$Cmax,$N,$n,$outdeg,$Pᵣ,$Pᵢ,$P,$μ,$σ)
@benchmark main_vec3($Cmin,$Cmax,$N,$n,$outdeg,$Pᵣ,$Pᵢ,$P,$μ,$σ)

Running the benchmarks:

julia> Cmin=0.5;Cmax=0.9;N=1000;n=5000;outdeg=250;Pᵣ=0.0;Pᵢ=0.5;P=500;μ=pi/2;σ=pi/18;

julia> @benchmark main($Cmin,$Cmax,$N,$n,$outdeg,$Pᵣ,$Pᵢ,$P,$μ,$σ)
BenchmarkTools.Trial: 
  memory estimate:  6.85 MiB
  allocs estimate:  14520
  --------------
  minimum time:     341.824 ms (0.00% GC)
  median time:      381.903 ms (0.00% GC)
  mean time:        377.373 ms (0.06% GC)
  maximum time:     417.986 ms (0.38% GC)
  --------------
  samples:          14
  evals/sample:     1

julia> @benchmark main_vec1($Cmin,$Cmax,$N,$n,$outdeg,$Pᵣ,$Pᵢ,$P,$μ,$σ)
BenchmarkTools.Trial: 
  memory estimate:  6.77 MiB
  allocs estimate:  14503
  --------------
  minimum time:     270.561 ms (0.00% GC)
  median time:      313.175 ms (0.00% GC)
  mean time:        327.182 ms (0.06% GC)
  maximum time:     503.980 ms (0.00% GC)
  --------------
  samples:          16
  evals/sample:     1

julia> @benchmark main_vec2($Cmin,$Cmax,$N,$n,$outdeg,$Pᵣ,$Pᵢ,$P,$μ,$σ)
BenchmarkTools.Trial: 
  memory estimate:  6.71 MiB
  allocs estimate:  14491
  --------------
  minimum time:     202.743 ms (0.00% GC)
  median time:      238.547 ms (0.00% GC)
  mean time:        248.084 ms (0.10% GC)
  maximum time:     311.537 ms (0.00% GC)
  --------------
  samples:          21
  evals/sample:     1

julia> @benchmark main_vec3($Cmin,$Cmax,$N,$n,$outdeg,$Pᵣ,$Pᵢ,$P,$μ,$σ)
BenchmarkTools.Trial: 
  memory estimate:  6.83 MiB
  allocs estimate:  14515
  --------------
  minimum time:     213.319 ms (0.00% GC)
  median time:      240.131 ms (0.00% GC)
  mean time:        247.002 ms (0.10% GC)
  maximum time:     292.142 ms (0.00% GC)
  --------------
  samples:          21
  evals/sample:     1

Note that @btime returns the minimum, so it went from
342 ns without vectorization, to
271 ns with SLEEFPirates-vectorized sincos, to
203 ns with SLEEFPirates-vectorized sincos and vectorized compression, and
213 ns with xsimd-vectorized sincos and vectorized compression.

Caveat on the vectorized compressions again being that they require avx512.

From there, I’d recomend profiling to see where the code is spending most of it’s time before trying anything else.

4 Likes

Wow :slight_smile: The compression stuff was unknown to me. Any hopes julia will expose such a compress function that does the right thing for various CPUs?

Is this a general pattern for avoiding branches or is this kind of a custom solution that fits this particular problem? Feels like it could be generally applicable whenever there is a somewhat expensive loop body that fails to vectorize due to a branch that omits some computations. It seems the savings from vectorizing in this example was on the order of 10us whereas the cost of the compression (without avx512 stuff) was on the order of 100ns. The maximum overall saving of a factor ~1.7 seems quite nice for this example.

I happen to know from a slack discussion that the OP is trying to implement parts of the algorithm on a GPU. My intuition was to disregard the branch and simply do the calculation for the whole vector, wasting some calculations that are multiplied by 0 (I have not tried it out )

1 Like

What about the accuracy?

I recommend doing algorithmic optimization before vectorization. :wink:

function main2(Cmin::Float64,Cmax::Float64,N::Int64,n::Int64,outdeg::Int64,Pᵣ::Float64,Pᵢ::Float64,P::Int64,μ::Float64,σ::Float64)
    Random.seed!(1)
    A=Arr(0,Array{Float64}(undef,N),Array{Float64}(undef,N),watts_strogatz(N,outdeg,Pᵣ,is_directed=true),
          Array{Float64}(undef,N),Array{Int64}(undef,round(Int,P)),Array{Float64}(undef,N),falses(N))
    randn!(A.θ)
    A.θ .= muladd.(σ, A.θ, μ)
    sinθ = sin.(A.θ)
    cosθ = cos.(A.θ)
    sinθ_next = copy(sinθ)
    cosθ_next = copy(cosθ)
    rand!(A.C)
    δC = Cmax - Cmin
    sumAw = 0.0
    @inbounds @simd ivdep for i ∈ eachindex(A.C)
        Cᵢ = muladd(A.C[i], δC, Cmin)
        A.C[i] = Cᵢ
        Wᵢ = 1.0 - Cᵢ
        A.W[i] = Wᵢ
        sumAw += Wᵢ
    end
    A.W .*= inv(sumAw)
    θₜ=Vector{Float64}[]
    agents=range(1,length=N)
    while(any(abs(x - A.θ[1])>0.01 for x in A.θ))
        A.t+=1
        push!(θₜ,copy(A.θ))
        sample!(Random.GLOBAL_RNG,agents,Weights(A.W),A.upagents,replace=false)
        @inbounds for i in A.upagents
            s=0.0
            c=0.0
            cedg_sum = 0.0
            for j in inneighbors(A.G, i)
                if rand()<Pᵢ && i!=j
                    cedg_j = exp(-abs(θₜ[A.t][i]-θₜ[A.t][j])/(1.0-A.C[i]))
                    cedg_sum += cedg_j
                    s += cedg_j * sinθ[j]
                    c += cedg_j * cosθ[j]
                end
            end
            cedg_i = A.C[i] * cedg_sum / (1.0-A.C[i])
            s += cedg_i * sinθ[i]
            c += cedg_i * cosθ[i]

            A.θ[i]=atan(s,c)
            α = 1 / sqrt(s * s + c * c)
            sinθ_next[i] = s * α
            cosθ_next[i] = c * α
        end
        copy!(sinθ, sinθ_next)
        copy!(cosθ, cosθ_next)
    end
    return A.t,A.θ[1],sum(A.C.*θₜ[1])/sum(A.C)
end

(Notice that I added a fixed seed at the start of the function.)
On my computer the timing compared to your baseline main, with the same seed added, is

julia> @btime main($Cmin,$Cmax,$N,$n,$outdeg,$Pᵣ,$Pᵢ,$P,$μ,$σ)
  287.836 ms (14576 allocations: 7.13 MiB)
(174, 1.5671239129649672, 1.5694010764401172)

julia> @btime main2($Cmin,$Cmax,$N,$n,$outdeg,$Pᵣ,$Pᵢ,$P,$μ,$σ)
  130.656 ms (14580 allocations: 7.17 MiB)
(174, 1.5671239129649661, 1.5694010764401172)

It should be added that as noted earlier in the thread (about message 48) the loop we’re using here isn’t quite the right one, so the timings might turn out differently if that is fixed.

And yes, I do appreciate seeing new vectorization tricks.

2 Likes

Yes, I have been trying to port parts of this code on to the GPU. However, coming up with an alternate logic for some of the conditional statements inside the for loop are turning out to be difficult.

I tried running a part of the code on GPU, but it takes longer

mutable struct Arr
    t::Int64
    θ::Array{Float64,1}
    C::Array{Float64,1}
    G::SimpleDiGraph{Int64}
    W::Array{Float64,1}
    upagents::Array{Int64,1}
    cedg::Array{Float64,1}
    cn::Array{Bool,1}
end

function update!(s::CuArray,c::CuArray,cedg::CuArray,θ::CuArray)
  s .= cedg.*sin.(θ)
  c .= cedg.*cos.(θ)
end

function main(Cmin::Float64,Cmax::Float64,N::Int64,n::Int64,outdeg::Int64,Pᵣ::Float64,Pᵢ::Float64,P::Int64,μ::Float64,σ::Float64)
    A=Arr(0,Array{Float64}(undef,N),Array{Float64}(undef,N),watts_strogatz(N,outdeg,Pᵣ,is_directed=true),
    Array{Float64}(undef,N),Array{Int64}(undef,round(Int,P)),Array{Float64}(undef,N),falses(N))
    rand!(Normal(μ,σ),A.θ)
    rand!(Uniform(Cmin,Cmax),A.C)
    A.W.=1.0.-A.C
    A.W./=sum(A.W)
    θₜ=Vector{Float64}[]
    agents=range(1,length=N)
    s=CuArrays.fill(0.0f0,N)
    c=CuArrays.fill(0.0f0,N)
    while(any(abs(x - A.θ[1])>0.01 for x in A.θ))
        A.t+=1
        push!(θₜ,copy(A.θ))
        sample!(Random.GLOBAL_RNG,agents,Weights(A.W),A.upagents,replace=false)
        @inbounds for h in A.upagents
            A.cedg.=0.0
            A.cn.=false
            for j in inneighbors(A.G,h)
                A.cn[j]=true
            end
            @inbounds for i in 1:N
                if((A.cn[i]==true)||(A.cn[i]==false && rand()<Pᵢ))
                    A.cedg[i]=exp(-abs(θₜ[A.t][h]-θₜ[A.t][i])/(1.0-A.C[h]))
                end
            end
            A.cedg[h]=0.0
            A.cedg.*=((1.0-A.C[h])/sum(A.cedg))
            A.cedg[h]=A.C[h]
            s.=0.0f0
            c.=0.0f0
            update!(s,c,cu(A.cedg),cu(θₜ[A.t]))
            A.θ[h]=atan(sum(Array(s)),sum(Array(c)))
        end
    end
    return A.t,A.θ[1],sum(A.C.*θₜ[1])/sum(A.C)
end

For the following values:

using BenchmarkTools
Cmin=0.5;Cmax=0.9;N=1000;n=5000;outdeg=Int(0.25*N);Pᵣ=0.0;Pᵢ=0.5;P=Int(0.5*N);μ=pi/2;σ=pi/18;
@btime main(Cmin,Cmax,N,n,outdeg,Pᵣ,Pᵢ,P,μ,σ)

Yields,

10.538 s (22077831 allocations: 2.92 GiB)
(187, 1.5737587213516235, 1.5750543941387851)

It takes excruciatingly long to run this code for N=10000

Sending data back and forth between cpu and gpu is costly. Why not perform the sum on the gpu?
Also, if you’re going to run stuff on the gpu, I would guess you need to have the entire algorithm on the gpu to avoid excessive data transfer.

I tried having the inner for-loop (for h in upagents) on the GPU. But, I couldn’t find alternatives to a few commands that are in there. I posted about it here - Rewriting function on CPU for execution on GPU