How to make this piece of code faster? optimising it or using multiple threads?

baggepinnen · November 19, 2019, 7:06am

Something seems to be going on because the speedup is fairly consistent, the worst execution of the sincos version is more than 3x faster than the best execution of the original version

julia> @benchmark main(Cmin,Cmax,N,n,outdeg,Pᵣ,Pᵢ,P,μ,σ) # Before
BenchmarkTools.Trial: 
  memory estimate:  6.85 MiB
  allocs estimate:  14517
  --------------
  minimum time:     407.317 ms (0.00% GC)
  median time:      444.515 ms (0.00% GC)
  mean time:        457.268 ms (0.06% GC)
  maximum time:     524.276 ms (0.00% GC)
  --------------
  samples:          11
  evals/sample:     1

julia> @benchmark main(Cmin,Cmax,N,n,outdeg,Pᵣ,Pᵢ,P,μ,σ) # After
BenchmarkTools.Trial: 
  memory estimate:  4.71 MiB
  allocs estimate:  14100
  --------------
  minimum time:     52.424 ms (6.97% GC)
  median time:      81.521 ms (0.00% GC)
  mean time:        81.394 ms (0.29% GC)
  maximum time:     120.478 ms (0.00% GC)
  --------------
  samples:          62
  evals/sample:     1

ennvvy · November 19, 2019, 7:10am

This is what I get after making changes you suggested:

@benchmark otpt=main(Cmin,Cmax,N,n,outdeg,Pᵣ,Pᵢ,P,μ,σ)
BenchmarkTools.Trial: 
  memory estimate:  6.78 MiB
  allocs estimate:  14503
  --------------
  minimum time:     670.812 ms (0.00% GC)
  median time:      728.309 ms (0.00% GC)
  mean time:        787.104 ms (0.05% GC)
  maximum time:     1.038 s (0.26% GC)
  --------------
  samples:          7
  evals/sample:     1

baggepinnen · November 19, 2019, 7:13am

You can try forcing it harder

using SLEEF

            @inbounds @simd for k in 1:N
                if (cedg[k]!=0)
                    si,ci = SLEEF.sincos_fast(θₜ[A.t][k])
                    s+=si
                    c+=ci
                    
                end
            end

ennvvy · November 19, 2019, 7:38am

Strange, I tried what you asked me to and still get the following

@benchmark main(Cmin,Cmax,N,n,outdeg,Pᵣ,Pᵢ,P,μ,σ)
BenchmarkTools.Trial: 
  memory estimate:  6.85 MiB
  allocs estimate:  14515
  --------------
  minimum time:     722.684 ms (0.00% GC)
  median time:      797.078 ms (0.00% GC)
  mean time:        802.595 ms (0.03% GC)
  maximum time:     918.274 ms (0.00% GC)
  --------------
  samples:          7
  evals/sample:     1

The code:

using Distributions,Distributed,StatsBase,LinearAlgebra,LightGraphs,BenchmarkTools,Random,Plots,SLEEF

mutable struct Arr
    t::Int64
    agents::Array{Int64,1}
    θ::Array{Float64,1}
    C::Array{Float64,1}
    G::SimpleDiGraph{Int64}
    W::Array{Float64,1}
    upagents::Array{Int64,1}
    cedg::Array{Float64,1}
end

function main(Cmin::Float64,Cmax::Float64,N::Int64,n::Int64,outdeg::Int64,Pᵣ::Float64,Pᵢ::Float64,P::Int64,μ::Float64,σ::Float64)
    A=Arr(0,[1:N;],Array{Float64}(undef,N),Array{Float64}(undef,N),watts_strogatz(N,outdeg,Pᵣ,is_directed=true),
    Array{Float64}(undef,N),Array{Int64}(undef,round(Int,P)),Array{Float64}(undef,N))
    rand!(Normal(μ,σ),A.θ)
    rand!(Uniform(Cmin,Cmax),A.C)
    A.W.=1.0.-A.C
    A.W./=sum(A.W)
    θₜ=Vector{Float64}[]
    while(any(abs(x - A.θ[1])>0.01 for x in A.θ))
        A.t+=1
        push!(θₜ,copy(A.θ))
        sample!(Random.GLOBAL_RNG,A.agents,Weights(A.W),A.upagents,replace=false)
        @inbounds for i in A.upagents
            A.cedg.=0.0
            @inbounds for j in inneighbors(A.G, i)
                if(rand()<Pᵢ && i!=j)
                    A.cedg[j]=exp(-abs(θₜ[A.t][i]-θₜ[A.t][j])/(1.0-A.C[i]))
                end
            end
            A.cedg./=(sum(A.cedg)/(1.0-A.C[i]))
            A.cedg[i]=A.C[i]
            s=0.0
            c=0.0
            @inbounds @simd for k in 1:N
                if(A.cedg[k]!=0)
                    si,ci=SLEEF.sincos_fast(θₜ[A.t][k])
                    s+=A.cedg[k]*si
                    c+=A.cedg[k]*ci
                end
            end
            A.θ[i]=atan(s,c)
        end
    end
    return A.t,A.θ[1],sum(A.C.*θₜ[1])/sum(A.C)
    #return θₜ
end

Has it got something to do with the Julia version that I am using ? Mine’s 1.3.0

baggepinnen · November 19, 2019, 7:46am

I’m also on 1.3. I think it’s simply down to the processors we’re using, they might have different instruction sets.

@Elrod I hope you don’t mind me tagging you here, you are generally very knowledgeable when it comes to low-level optimization. Do you have any clue as to why I’m seeing a dramatic speedup from modifying the loop

 @inbounds for k in 1:N
    if cedg[k] != 0
        s+=cedg[k]*sin(θₜ[A.t][k])
        c+=cedg[k]*cos(θₜ[A.t][k])
    end
end

to

@inbounds for k in 1:N
    if cedg[k] != 0
        si,ci = sincos(θₜ[A.t][k])
        s+=si
        c+=ci
    end
end

whereas @ennvvy sees no change?

GunnarFarneback · November 19, 2019, 7:53am

The difference is that you’re not computing the same thing by omitting the multiplication with cedg[k]. While that might seem like a minor thing from a performance point of view it has a big impact on the number of iterations before convergence.

baggepinnen · November 19, 2019, 7:59am

You’re right, I completely missed that! Adding the multiplication back in removes the performance difference. Maybe I should go home and sleep for a while

With the correct multiplication, the improvement is much more modest.

GunnarFarneback · November 19, 2019, 8:52am

I’m a bit surprised by this change in the code:

            @inbounds for j in inneighbors(A.G, i) 
              if(rand()<Pᵢ && i!=j)
                cedg[j]=exp(-abs(θₜ[A.t][i]-θₜ[A.t][j])/(1.0-A.C[i]))
              end
            end

Earlier you were computing this for non-neighbors in the graph, now you’re doing it for the neighbors instead. I don’t know what the algorithm is supposed to do so I can’t rule out that it’s a valid transformation, but it does seem suspicious.

ennvvy · November 19, 2019, 9:03am

Oops, you’re right. I have messed that part up!

Elrod · November 19, 2019, 10:43pm

While you explained most of the change with the presence/absence of multiplication, could you and ennvvy compare the relative performance difference of (sin(x), cos(x)) and sincos(x)?
Looking at the code in special/trig.jl it looks like the relative performance out to be about the same, given that sincos_kernel is defined as a function of both sin_kernel and cos_kernel – unless the compiler fails to eliminate duplicate code on one of the architectures.

Regarding SLEEF the functions need to be marked @inline to actually inline, which in turn is needed for them to take advantage of SIMD. I did this in SLEEFPirates.jl, and also added support for NTuple{N,Core.VecElement{T}} where {T <: Union{Float32,Float64}} types as well, to allow for explicit vectorization.

baggepinnen · November 20, 2019, 4:54am

Thanks for taking a look Here are my timings

julia> @btime (sin(a),cos(a)) setup=(a=0.1randn()) evals=100000;
  8.250 ns (0 allocations: 0 bytes)

julia> @btime sincos(a) setup=(a=0.1randn()) evals=100000;
  6.750 ns (0 allocations: 0 bytes)

ennvvy · November 20, 2019, 1:53pm

I get the following:

@btime (sin(a),cos(a)) setup=(a=0.1randn()) evals=100000;
  14.616 ns (0 allocations: 0 bytes)

@btime sincos(a) setup=(a=0.1randn()) evals=100000;
  11.035 ns (0 allocations: 0 bytes)

On a 2.4 GHz Dual-Core Intel Core i5, if that means something.

Elrod · November 20, 2019, 2:06pm

Benchmarks on

julia> versioninfo()
Julia Version 1.3.0-rc5.1
Commit 36c4eb2* (2019-11-17 19:04 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, haswell)

Benchmarking yesterday’s version with Base.sincos minus the ARPACK dependency that I don’t feel like installing, and a couple other tweaks:

using StatsBase,LinearAlgebra,LightGraphs,BenchmarkTools,Random

mutable struct Arr
    t::Int64
    agents::Array{Int64,1}
    θ::Array{Float64,1}
    C::Array{Float64,1}
    G::SimpleDiGraph{Int64}
    W::Array{Float64,1}
end

function main(Cmin::Float64,Cmax::Float64,N::Int64,n::Int64,outdeg::Int64,Pᵣ::Float64,Pᵢ::Float64,P::Int64,μ::Float64,σ::Float64)
    A=Arr(0,[1:N;],Array{Float64}(undef,N),Array{Float64}(undef,N),watts_strogatz(N,outdeg,Pᵣ,is_directed=true),
    Array{Float64}(undef,N))
    randn!(A.θ)
    A.θ .= muladd.(A.θ, σ, μ)
    rand!(A.C)
    A.C .= muladd.(Cmax - Cmin, A.C, Cmin)
    A.W.=1.0.-A.C
    A.W .*= (1/sum(A.W))
    θₜ = Vector{Float64}[]
    upagents = Array{Int64}(undef, (N+1) >>> 1)
    cedg = Array{Float64}(undef,N)
    while(any(abs(x - A.θ[1]) > 0.01 for x in A.θ))
        A.t+=1
        push!(θₜ,copy(A.θ))
        sample!(Random.GLOBAL_RNG, A.agents, Weights(A.W), upagents, replace=false)
        @inbounds for i in upagents
            cedg.=0.0
            @inbounds for j in inneighbors(A.G, i) 
              if(i!=j && rand()<Pᵢ)
                cedg[j]=exp(-abs(θₜ[A.t][i]-θₜ[A.t][j])/(1.0-A.C[i]))
              end
            end
            cedg .*= ((1.0-A.C[i]) / sum(cedg))
            cedg[i]=A.C[i]
            s=0.0
            c=0.0
            @inbounds for k in 1:N
              if(cedg[k]!=0)
                sinθₜ, cosθₜ = sincos(θₜ[A.t][k])
                s+=cedg[k]*sinθₜ
                c+=cedg[k]*cosθₜ
              end
            end
            A.θ[i]=atan(s,c)
        end
    end
    #return A.t,A.θ[1],sum(A.C.*θₜ[1])/sum(A.C),mean(A.C)
    return θₜ
end

julia> Cmin=0.5;Cmax=0.9;N=1000;n=5000;outdeg=250;Pᵣ=0.0;Pᵢ=0.5;P=500;μ=pi/2;σ=pi/18;
julia> @benchmark main($Cmin,$Cmax,$N,$n,$outdeg,$Pᵣ,$Pᵢ,$P,$μ,$σ) # $ on principal, but irrelevant for these runtimes
BenchmarkTools.Trial:
  memory estimate:  6.87 MiB
  allocs estimate:  14520
  --------------
  minimum time:     452.444 ms (0.00% GC)
  median time:      503.295 ms (0.00% GC)
  mean time:        518.045 ms (0.04% GC)
  maximum time:     650.211 ms (0.00% GC)
  --------------
  samples:          10
  evals/sample:     1

This was slightly faster than before the tweaks.
I did try actually vectorizing the sincos loop, but it was slower.
You have to get rid of the branch to vectorize. I assume a high proportion of cedg are in fact 0?

ennvvy · November 20, 2019, 2:10pm

First the version I am using:

Julia Version 1.3.0-rc4.1
Commit 8c4656b97a (2019-10-15 14:08 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.6.0)
  CPU: Intel(R) Core(TM) i5-4258U CPU @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, haswell)
Environment:
  JULIA_EDITOR = atom  -a
  JULIA_NUM_THREADS = 2

To your comment on vectorisation: I had it that way earlier, but I guess the vectorization caused more allocations? This is the final version that I am working with, the addition of a for loop and some conditions has spiked it up to nearly 3s-4s with almost the same number allocations.

using Distributions,Distributed,StatsBase,LinearAlgebra,LightGraphs,BenchmarkTools,Random,Plots

mutable struct Arr
    t::Int64
    θ::Array{Float64,1}
    C::Array{Float64,1}
    G::SimpleDiGraph{Int64}
    W::Array{Float64,1}
    upagents::Array{Int64,1}
    cedg::Array{Float64,1}
    cn::Array{Bool,1}
end

function main(Cmin::Float64,Cmax::Float64,N::Int64,n::Int64,outdeg::Int64,Pᵣ::Float64,Pᵢ::Float64,P::Int64,μ::Float64,σ::Float64)
    A=Arr(0,Array{Float64}(undef,N),Array{Float64}(undef,N),watts_strogatz(N,outdeg,Pᵣ,is_directed=true),
    Array{Float64}(undef,N),Array{Int64}(undef,round(Int,P)),Array{Float64}(undef,N),falses(N))
    rand!(Normal(μ,σ),A.θ)
    rand!(Uniform(Cmin,Cmax),A.C)
    A.W.=1.0.-A.C
    A.W./=sum(A.W)
    θₜ=Vector{Float64}[]
    agents=range(1,length=N)
    while(any(abs(x - A.θ[1])>0.01 for x in A.θ))
        A.t+=1
        push!(θₜ,copy(A.θ))
        sample!(Random.GLOBAL_RNG,agents,Weights(A.W),A.upagents,replace=false)
        @inbounds for h in A.upagents
            A.cedg.=0.0
            A.cn.=false
            @inbounds for j in inneighbors(A.G,h)
                A.cn[j]=true
            end
            @inbounds for i in 1:N
                if((A.cn[i]==true)||(A.cn[i]==false && rand()<Pᵢ))
                    A.cedg[i]=exp(-abs(θₜ[A.t][h]-θₜ[A.t][i])/(1.0-A.C[h]))
                end
            end
            A.cedg[h]=0.0
            A.cedg./=(sum(A.cedg)/(1.0-A.C[h]))
            A.cedg[h]=A.C[h]
            s=0.0
            c=0.0
            @inbounds for l in 1:N
                if(A.cedg[l]!=0)
                    s+=A.cedg[l]*sin(θₜ[A.t][l])
                    c+=A.cedg[l]*cos(θₜ[A.t][l])
                end
            end
            A.θ[h]=atan(s,c)
        end
    end
    return A.t,A.θ[1],sum(A.C.*θₜ[1])/sum(A.C)
    #return θₜ
end

Elrod · November 20, 2019, 3:30pm

I would try replacing some of the divisions with multiplications like I did. In particular, replace:

A.cedg./=(sum(A.cedg)/(1.0-A.C[h]))

with

A.cedg .*= ((1.0-A.C[h]) / sum(A.cedg))

Multiplication is much faster than addition.

As for vectorization, I don’t believe the version you tried was vectorized.
You can try moving that loop to its own function and microbenchmarking it.

Here, to talk about vectorization, I delete the branches to more simply compare vectorized vs unvectorized versions:

using BenchmarkTools, LoopVectorization, SLEEF
θ = randn(1000); c = randn(1000);
function sumsc_vectorized(θ::AbstractArray{Float64}, coef::AbstractArray{Float64})
    s, c = 0.0, 0.0
    @vvectorize for i ∈ eachindex(θ, coef)
        sinθᵢ, cosθᵢ = sincos(θ[i])
        s += coef[i] * sinθᵢ
        c += coef[i] * cosθᵢ
    end
    s, c
end
function sumsc_serial(θ::AbstractArray{Float64}, coef::AbstractArray{Float64})
    s, c = 0.0, 0.0
    @inbounds for i ∈ eachindex(θ, coef)
        sinθᵢ, cosθᵢ = sincos(θ[i])
        s += coef[i] * sinθᵢ
        c += coef[i] * cosθᵢ
    end
    s, c
end
function sumsc_sleef(θ::AbstractArray{Float64}, coef::AbstractArray{Float64})
    s, c = 0.0, 0.0
    @inbounds @simd for i ∈ eachindex(θ, coef)
        sinθᵢ, cosθᵢ = SLEEF.sincos_fast(θ[i])
        s += coef[i] * sinθᵢ
        c += coef[i] * cosθᵢ
    end
    s, c
end
 
@btime sumsc_serial($θ, $c)
@btime sumsc_sleef($θ, $c)
@btime sumsc_vectorized($θ, $c)

This yielded

julia> @btime sumsc_serial($θ, $c)
  15.532 μs (0 allocations: 0 bytes)
(-15.810428689848422, 21.21581166741435)

julia> @btime sumsc_sleef($θ, $c)
  23.051 μs (0 allocations: 0 bytes)
(-15.810428689848425, 21.215811667414357)

julia> @btime sumsc_vectorized($θ, $c)
  4.764 μs (0 allocations: 0 bytes)
(-15.810428689848425, 21.21581166741438)

Just from these times, you can tell that SLEEF did not vectorize. But to be sure, you can look at the llvm:

 1938 julia> @code_llvm debuginfo=:none sumsc_sleef(θ, c)
 1939
 1940 define void @julia_sumsc_sleef_19233([2 x double]* noalias nocapture sret, %jl_value_t addrspace(10)* nonnull align 16 dereferenceable(40), %jl_value_t addrspace(10)* nonnull align 16 dereferenceable(40)) {
 1941 top:
 1942   %3 = alloca %jl_value_t addrspace(10)*, i32 3
 1943   %4 = alloca <2 x double>, align 16
 1944   %tmpcast = bitcast <2 x double>* %4 to [2 x double]*
 1945   %5 = addrspacecast %jl_value_t addrspace(10)* %1 to %jl_value_t addrspace(11)*
 1946   %6 = bitcast %jl_value_t addrspace(11)* %5 to %jl_value_t addrspace(10)* addrspace(11)*
 1947   %7 = getelementptr inbounds %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)* addrspace(11)* %6, i64 3
 1948   %8 = bitcast %jl_value_t addrspace(10)* addrspace(11)* %7 to i64 addrspace(11)*
 1949   %9 = load i64, i64 addrspace(11)* %8, align 8
 1950   %10 = icmp sgt i64 %9, 0
 1951   %11 = select i1 %10, i64 %9, i64 0
 1952   %12 = addrspacecast %jl_value_t addrspace(10)* %2 to %jl_value_t addrspace(11)*
 1953   %13 = bitcast %jl_value_t addrspace(11)* %12 to %jl_value_t addrspace(10)* addrspace(11)*
 1954   %14 = getelementptr inbounds %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)* addrspace(11)* %13, i64 3
 1955   %15 = bitcast %jl_value_t addrspace(10)* addrspace(11)* %14 to i64 addrspace(11)*
 1956   %16 = load i64, i64 addrspace(11)* %15, align 8
 1957   %17 = icmp sgt i64 %16, 0
 1958   %18 = select i1 %17, i64 %16, i64 0
 1959   %19 = icmp eq i64 %11, %18
 1960   br i1 %19, label %L17, label %L13
 1961
 1962 L13:                                              ; preds = %top
 1963   %20 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %3, i32 0
 1964   store %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 139815113578320 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)** %20
 1965   %21 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %3, i32 1
 1966   store %jl_value_t addrspace(10)* %1, %jl_value_t addrspace(10)** %21
 1967   %22 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %3, i32 2
 1968   store %jl_value_t addrspace(10)* %2, %jl_value_t addrspace(10)** %22
 1969   %23 = call nonnull %jl_value_t addrspace(10)* @jl_invoke(%jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 139815178141088 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)** %3, i32 3, %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 139812628328400 to %jl_value_t*) to %jl_value_t addrspace(10)*))
 1970   call void @llvm.trap()
 1971   unreachable
 1972
 1973 L17:                                              ; preds = %top
 1974   br i1 %10, label %L25.lr.ph, label %L52
 1975
 1976 L25.lr.ph:                                        ; preds = %L17
 1977   %24 = bitcast %jl_value_t addrspace(11)* %5 to double addrspace(13)* addrspace(11)*
 1978   %25 = bitcast %jl_value_t addrspace(11)* %12 to double addrspace(13)* addrspace(11)*
 1979   br label %L25
 1980
 1981 L25:                                              ; preds = %L25.lr.ph, %L25
 1982   %value_phi217 = phi i64 [ 0, %L25.lr.ph ], [ %38, %L25 ]
 1983   %26 = phi <2 x double> [ zeroinitializer, %L25.lr.ph ], [ %37, %L25 ]
 1984   %27 = load double addrspace(13)*, double addrspace(13)* addrspace(11)* %24, align 8
 1985   %28 = getelementptr inbounds double, double addrspace(13)* %27, i64 %value_phi217
 1986   %29 = load double, double addrspace(13)* %28, align 8
 1987   call void @julia_sincos_fast_19234([2 x double]* noalias nocapture nonnull sret %tmpcast, double %29)
 1988   %30 = load double addrspace(13)*, double addrspace(13)* addrspace(11)* %25, align 8
 1989   %31 = getelementptr inbounds double, double addrspace(13)* %30, i64 %value_phi217
 1990   %32 = load double, double addrspace(13)* %31, align 8
 1991   %33 = load <2 x double>, <2 x double>* %4, align 16
 1992   %34 = insertelement <2 x double> undef, double %32, i32 0
 1993   %35 = shufflevector <2 x double> %34, <2 x double> undef, <2 x i32> zeroinitializer
 1994   %36 = fmul contract <2 x double> %35, %33
 1995   %37 = fadd fast <2 x double> %26, %36
 1996   %38 = add nuw nsw i64 %value_phi217, 1
 1997   %39 = icmp ult i64 %38, %11
 1998   br i1 %39, label %L25, label %L52
 1999
 2000 L52:                                              ; preds = %L25, %L17
 2001   %40 = phi <2 x double> [ zeroinitializer, %L17 ], [ %37, %L25 ]
 2002   %41 = bitcast [2 x double]* %0 to <2 x double>*
 2003   store <2 x double> %40, <2 x double>* %41, align 8
 2004   ret void
 2005 }

My copy/paste situation is awkward at work. Please ignore the line numbers. I don’t know how to turn them off, and don’t feel like going through and removing them.
Versus the vectorized version

 1830 julia> @code_llvm debuginfo=:none sumsc_vectorized(θ, c)
 1831
 1832 define void @julia_sumsc_vectorized_18810([2 x double]* noalias nocapture sret, %jl_value_t addrspace(10)* nonnull align 16 dereferenceable(40), %jl_value_t addrspace(10)* nonnull align 16 dereferenceable(40)) {
 1833 top:
 1834   %3 = addrspacecast %jl_value_t addrspace(10)* %1 to %jl_value_t addrspace(11)*
 1835   %4 = bitcast %jl_value_t addrspace(11)* %3 to %jl_array_t addrspace(11)*
 1836   %5 = getelementptr inbounds %jl_array_t, %jl_array_t addrspace(11)* %4, i64 0, i32 1
 1837   %6 = load i64, i64 addrspace(11)* %5, align 8
 1838   %7 = addrspacecast %jl_value_t addrspace(10)* %2 to %jl_value_t addrspace(11)*
 1839   %8 = bitcast %jl_value_t addrspace(11)* %7 to %jl_array_t addrspace(11)*
 1840   %9 = getelementptr inbounds %jl_array_t, %jl_array_t addrspace(11)* %8, i64 0, i32 1
 1841   %10 = load i64, i64 addrspace(11)* %9, align 8
 1842   %11 = icmp slt i64 %10, %6
 1843   %12 = select i1 %11, i64 %10, i64 %6
 1844   %13 = lshr i64 %12, 2
 1845   %14 = addrspacecast %jl_value_t addrspace(11)* %7 to %jl_value_t*
 1846   %15 = bitcast %jl_value_t* %14 to i64*
 1847   %16 = load i64, i64* %15, align 8
 1848   %17 = addrspacecast %jl_value_t addrspace(11)* %3 to %jl_value_t*
 1849   %18 = bitcast %jl_value_t* %17 to i64*
 1850   %19 = load i64, i64* %18, align 8
 1851   %20 = icmp eq i64 %13, 0
 1852   br i1 %20, label %L223, label %L22.preheader
 1853
 1854 L22.preheader:                                    ; preds = %top
 1855   %21 = inttoptr i64 %19 to i8*
 1856   %22 = inttoptr i64 %16 to i8*
 1857   br label %L22
 1858
 1859 L22:                                              ; preds = %L22.preheader, %L22
 1860   %value_phi2 = phi <4 x double> [ %res.i96, %L22 ], [ zeroinitializer, %L22.preheader ]
 1861   %value_phi3 = phi <4 x double> [ %res.i94, %L22 ], [ zeroinitializer, %L22.preheader ]
 1862   %value_phi4 = phi i64 [ %39, %L22 ], [ 0, %L22.preheader ]
 1863   %value_phi5 = phi i64 [ %41, %L22 ], [ 1, %L22.preheader ]
 1864   %23 = shl i64 %value_phi4, 3
 1865   %24 = getelementptr i8, i8* %21, i64 %23
 1866   %ptr.i = bitcast i8* %24 to <4 x double>*
 1867   %res.i = load <4 x double>, <4 x double>* %ptr.i, align 8
 1868   %res.i153 = fmul fast <4 x double> %res.i, <double 0x3E645F306DC9C883, double 0x3E645F306DC9C883, double 0x3E645F306DC9C883, double 0x3E645F306DC9C883>
 1869   %res.i152 = call <4 x double> @llvm.trunc.v4f64(<4 x double> %res.i153)
 1870   %res.i151 = fmul fast <4 x double> %res.i, <double 0x3FE45F306DC9C883, double 0x3FE45F306DC9C883, double 0x3FE45F306DC9C883, double 0x3FE45F306DC9C883>
 1871   %res.i150 = fmul fast <4 x double> %res.i152, <double 0x4170000000000000, double 0x4170000000000000, double 0x4170000000000000, double 0x4170000000000000>
 1872   %res.i149 = fsub fast <4 x double> %res.i151, %res.i150
 1873   %res.i148 = call <4 x double> @llvm.rint.v4f64(<4 x double> %res.i149)
 1874   %res.i147 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i152, <4 x double> <double 0xC17921FB50000000, double 0xC17921FB50000000, double 0xC17921FB50000000, double 0xC17921FB50000000>, <4 x double> %res.i)
 1875   %res.i146 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i148, <4 x double> <double 0xBFF921FB50000000, double 0xBFF921FB50000000, double 0xBFF921FB50000000, double 0xBFF921FB50000000>, <4 x double> %res.i147)
 1876   %res.i145 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i152, <4 x double> <double 0xBFD110B460000000, double 0xBFD110B460000000, double 0xBFD110B460000000, double 0xBFD110B460000000>, <4 x double> %res.i146)
 1877   %res.i144 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i148, <4 x double> <double 0xBE5110B460000000, double 0xBE5110B460000000, double 0xBE5110B460000000, double 0xBE5110B460000000>, <4 x double> %res.i145)
 1878   %res.i143 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i152, <4 x double> <double 0xBE11A62630000000, double 0xBE11A62630000000, double 0xBE11A62630000000, double 0xBE11A62630000000>, <4 x double> %res.i144)
 1879   %res.i142 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i148, <4 x double> <double 0xBC91A62630000000, double 0xBC91A62630000000, double 0xBC91A62630000000, double 0xBC91A62630000000>, <4 x double> %res.i143)
 1880   %res.i140 = fadd fast <4 x double> %res.i148, %res.i150
 1881   %res.i139 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i140, <4 x double> <double 0xBAE8A2E03707344A, double 0xBAE8A2E03707344A, double 0xBAE8A2E03707344A, double 0xBAE8A2E03707344A>, <4 x double> %res.i142)
 1882   %res.i138 = fmul fast <4 x double> %res.i139, %res.i139
 1883   %res.i137 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i138, <4 x double> <double 0x3DE5D82500BECB6B, double 0x3DE5D82500BECB6B, double 0x3DE5D82500BECB6B, double 0x3DE5D82500BECB6B>, <4 x double> <double 0xBE5AE5E1E6F6F6D8, double 0xBE5AE5E1E6F6F6D8, double 0xBE5AE5E1E6F6F6D8, double 0xBE5AE5E1E6F6F6D8>)
 1884   %res.i136 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i138, <4 x double> %res.i137, <4 x double> <double 0x3EC71DE3503EAE9C, double 0x3EC71DE3503EAE9C, double 0x3EC71DE3503EAE9C, double 0x3EC71DE3503EAE9C>)
 1885   %res.i135 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i138, <4 x double> %res.i136, <4 x double> <double 0xBF2A01A019B64F6A, double 0xBF2A01A019B64F6A, double 0xBF2A01A019B64F6A, double 0xBF2A01A019B64F6A>)
 1886   %res.i134 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i138, <4 x double> %res.i135, <4 x double> <double 0x3F8111111110F135, double 0x3F8111111110F135, double 0x3F8111111110F135, double 0x3F8111111110F135>)
 1887   %res.i133 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i138, <4 x double> %res.i134, <4 x double> <double 0xBFC5555555555542, double 0xBFC5555555555542, double 0xBFC5555555555542, double 0xBFC5555555555542>)
 1888   %res.i132 = fmul fast <4 x double> %res.i138, %res.i139
 1889   %res.i131 = fmul fast <4 x double> %res.i132, %res.i133
 1890   %res.i130 = fadd fast <4 x double> %res.i131, %res.i139
 1891   %res.i129 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i138, <4 x double> <double 0xBDA8FBF9C1BDB8CE, double 0xBDA8FBF9C1BDB8CE, double 0xBDA8FBF9C1BDB8CE, double 0xBDA8FBF9C1BDB8CE>, <4 x double> <double 0x3E21EEA016409F05, double 0x3E21EEA016409F05, double 0x3E21EEA016409F05, double 0x3E21EEA016409F05>)
 1892   %res.i128 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i138, <4 x double> %res.i129, <4 x double> <double 0xBE927E4F8130BE9C, double 0xBE927E4F8130BE9C, double 0xBE927E4F8130BE9C, double 0xBE927E4F8130BE9C>)
 1893   %res.i127 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i138, <4 x double> %res.i128, <4 x double> <double 0x3EFA01A019C8F025, double 0x3EFA01A019C8F025, double 0x3EFA01A019C8F025, double 0x3EFA01A019C8F025>)
 1894   %res.i126 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i138, <4 x double> %res.i127, <4 x double> <double 0xBF56C16C16C14C96, double 0xBF56C16C16C14C96, double 0xBF56C16C16C14C96, double 0xBF56C16C16C14C96>)
 1895   %res.i125 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i138, <4 x double> %res.i126, <4 x double> <double 0x3FA5555555555545, double 0x3FA5555555555545, double 0x3FA5555555555545, double 0x3FA5555555555545>)
 1896   %res.i124 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i138, <4 x double> %res.i125, <4 x double> <double -5.000000e-01, double -5.000000e-01, double -5.000000e-01, double -5.000000e-01>)
 1897   %res.i123 = fmul fast <4 x double> %res.i124, %res.i138
 1898   %res.i122 = fadd fast <4 x double> %res.i123, <double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00>
 1899   %25 = extractelement <4 x double> %res.i148, i32 0
 1900   %26 = fptosi double %25 to i64
 1901   %27 = extractelement <4 x double> %res.i148, i32 1
 1902   %28 = fptosi double %27 to i64
 1903   %29 = extractelement <4 x double> %res.i148, i32 2
 1904   %30 = fptosi double %29 to i64
 1905   %31 = extractelement <4 x double> %res.i148, i32 3
 1906   %32 = fptosi double %31 to i64
 1907   %33 = insertelement <4 x i64> undef, i64 %26, i32 0
 1908   %34 = insertelement <4 x i64> %33, i64 %28, i32 1
 1909   %35 = insertelement <4 x i64> %34, i64 %30, i32 2
 1910   %36 = insertelement <4 x i64> %35, i64 %32, i32 3
 1911   %37 = trunc <4 x i64> %36 to <4 x i1>
 1912   %res.i116 = select <4 x i1> %37, <4 x double> %res.i130, <4 x double> %res.i122
 1913   %res.i114 = select <4 x i1> %37, <4 x double> %res.i122, <4 x double> %res.i130
 1914   %res.i112 = and <4 x i64> %36, <i64 2, i64 2, i64 2, i64 2>
 1915   %res.i110 = icmp eq <4 x i64> %res.i112, zeroinitializer
 1916   %res.i109 = fsub fast <4 x double> <double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00>, %res.i114
 1917   %res.i108 = select <4 x i1> %res.i110, <4 x double> %res.i114, <4 x double> %res.i109
 1918   %res.i106 = add nsw <4 x i64> %36, <i64 1, i64 1, i64 1, i64 1>
 1919   %res.i105 = and <4 x i64> %res.i106, <i64 2, i64 2, i64 2, i64 2>
 1920   %res.i103 = icmp eq <4 x i64> %res.i105, zeroinitializer
 1921   %res.i102 = fsub fast <4 x double> <double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00>, %res.i116
 1922   %res.i101 = select <4 x i1> %res.i103, <4 x double> %res.i116, <4 x double> %res.i102
 1923   %38 = getelementptr i8, i8* %22, i64 %23
 1924   %ptr.i98 = bitcast i8* %38 to <4 x double>*
 1925   %res.i99 = load <4 x double>, <4 x double>* %ptr.i98, align 8
 1926   %res.i97 = fmul fast <4 x double> %res.i108, %res.i99
 1927   %res.i96 = fadd fast <4 x double> %res.i97, %value_phi2
 1928   %res.i95 = fmul fast <4 x double> %res.i101, %res.i99
 1929   %res.i94 = fadd fast <4 x double> %res.i95, %value_phi3
 1930   %39 = add nuw i64 %value_phi4, 4
 1931   %40 = icmp eq i64 %value_phi5, %13
 1932   %41 = add nuw nsw i64 %value_phi5, 1
 1933   br i1 %40, label %L223.loopexit, label %L22
 1934
 1935 L223.loopexit:                                    ; preds = %L22
 1936   %42 = shl i64 %12, 3
 1937   %phitmp = and i64 %42, -32
 1938   br label %L223
 1939
 1940 L223:                                             ; preds = %L223.loopexit, %top
 1941   %value_phi8 = phi <4 x double> [ zeroinitializer, %top ], [ %res.i96, %L223.loopexit ]
 1942   %value_phi9 = phi <4 x double> [ zeroinitializer, %top ], [ %res.i94, %L223.loopexit ]
 1943   %value_phi10 = phi i64 [ 0, %top ], [ %phitmp, %L223.loopexit ]
 1944   %43 = and i64 %12, 3
 1945   %44 = icmp eq i64 %43, 0
 1946   br i1 %44, label %L427, label %L228
 1947
 1948 L228:                                             ; preds = %L223
 1949   %45 = trunc i64 %43 to i8
 1950   %notmask = shl nsw i8 -1, %45
 1951   %46 = xor i8 %notmask, 15
 1952   %47 = inttoptr i64 %19 to i8*
 1953   %48 = getelementptr i8, i8* %47, i64 %value_phi10
 1954   %ptr.i90 = bitcast i8* %48 to <4 x double>*
 1955   %masktrunc.i91 = trunc i8 %46 to i4
 1956   %mask.i92 = bitcast i4 %masktrunc.i91 to <4 x i1>
 1957   %res.i93 = call <4 x double> @llvm.masked.load.v4f64.p0v4f64(<4 x double>* %ptr.i90, i32 8, <4 x i1> %mask.i92, <4 x double> zeroinitializer)
 1958   %res.i89 = fmul fast <4 x double> %res.i93, <double 0x3E645F306DC9C883, double 0x3E645F306DC9C883, double 0x3E645F306DC9C883, double 0x3E645F306DC9C883>
 1959   %res.i88 = call <4 x double> @llvm.trunc.v4f64(<4 x double> %res.i89)
 1960   %res.i87 = fmul fast <4 x double> %res.i93, <double 0x3FE45F306DC9C883, double 0x3FE45F306DC9C883, double 0x3FE45F306DC9C883, double 0x3FE45F306DC9C883>
 1961   %res.i86 = fmul fast <4 x double> %res.i88, <double 0x4170000000000000, double 0x4170000000000000, double 0x4170000000000000, double 0x4170000000000000>
 1962   %res.i85 = fsub fast <4 x double> %res.i87, %res.i86
 1963   %res.i84 = call <4 x double> @llvm.rint.v4f64(<4 x double> %res.i85)
 1964   %res.i83 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i88, <4 x double> <double 0xC17921FB50000000, double 0xC17921FB50000000, double 0xC17921FB50000000, double 0xC17921FB50000000>, <4 x double> %res.i93)
 1965   %res.i82 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i84, <4 x double> <double 0xBFF921FB50000000, double 0xBFF921FB50000000, double 0xBFF921FB50000000, double 0xBFF921FB50000000>, <4 x double> %res.i83)
 1966   %res.i81 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i88, <4 x double> <double 0xBFD110B460000000, double 0xBFD110B460000000, double 0xBFD110B460000000, double 0xBFD110B460000000>, <4 x double> %res.i82)
 1967   %res.i80 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i84, <4 x double> <double 0xBE5110B460000000, double 0xBE5110B460000000, double 0xBE5110B460000000, double 0xBE5110B460000000>, <4 x double> %res.i81)
 1968   %res.i79 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i88, <4 x double> <double 0xBE11A62630000000, double 0xBE11A62630000000, double 0xBE11A62630000000, double 0xBE11A62630000000>, <4 x double> %res.i80)
 1969   %res.i78 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i84, <4 x double> <double 0xBC91A62630000000, double 0xBC91A62630000000, double 0xBC91A62630000000, double 0xBC91A62630000000>, <4 x double> %res.i79)
 1970   %res.i76 = fadd fast <4 x double> %res.i84, %res.i86
 1971   %res.i75 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i76, <4 x double> <double 0xBAE8A2E03707344A, double 0xBAE8A2E03707344A, double 0xBAE8A2E03707344A, double 0xBAE8A2E03707344A>, <4 x double> %res.i78)
 1972   %res.i74 = fmul fast <4 x double> %res.i75, %res.i75
 1973   %res.i73 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i74, <4 x double> <double 0x3DE5D82500BECB6B, double 0x3DE5D82500BECB6B, double 0x3DE5D82500BECB6B, double 0x3DE5D82500BECB6B>, <4 x double> <double 0xBE5AE5E1E6F6F6D8, double 0xBE5AE5E1E6F6F6D8, double 0xBE5AE5E1E6F6F6D8, double 0xBE5AE5E1E6F6F6D8>)
 1974   %res.i72 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i74, <4 x double> %res.i73, <4 x double> <double 0x3EC71DE3503EAE9C, double 0x3EC71DE3503EAE9C, double 0x3EC71DE3503EAE9C, double 0x3EC71DE3503EAE9C>)
 1975   %res.i71 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i74, <4 x double> %res.i72, <4 x double> <double 0xBF2A01A019B64F6A, double 0xBF2A01A019B64F6A, double 0xBF2A01A019B64F6A, double 0xBF2A01A019B64F6A>)
 1976   %res.i70 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i74, <4 x double> %res.i71, <4 x double> <double 0x3F8111111110F135, double 0x3F8111111110F135, double 0x3F8111111110F135, double 0x3F8111111110F135>)
 1977   %res.i69 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i74, <4 x double> %res.i70, <4 x double> <double 0xBFC5555555555542, double 0xBFC5555555555542, double 0xBFC5555555555542, double 0xBFC5555555555542>)
 1978   %res.i68 = fmul fast <4 x double> %res.i74, %res.i75
 1979   %res.i67 = fmul fast <4 x double> %res.i68, %res.i69
 1980   %res.i66 = fadd fast <4 x double> %res.i67, %res.i75
 1981   %res.i65 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i74, <4 x double> <double 0xBDA8FBF9C1BDB8CE, double 0xBDA8FBF9C1BDB8CE, double 0xBDA8FBF9C1BDB8CE, double 0xBDA8FBF9C1BDB8CE>, <4 x double> <double 0x3E21EEA016409F05, double 0x3E21EEA016409F05, double 0x3E21EEA016409F05, double 0x3E21EEA016409F05>)
 1982   %res.i64 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i74, <4 x double> %res.i65, <4 x double> <double 0xBE927E4F8130BE9C, double 0xBE927E4F8130BE9C, double 0xBE927E4F8130BE9C, double 0xBE927E4F8130BE9C>)
 1983   %res.i63 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i74, <4 x double> %res.i64, <4 x double> <double 0x3EFA01A019C8F025, double 0x3EFA01A019C8F025, double 0x3EFA01A019C8F025, double 0x3EFA01A019C8F025>)
 1984   %res.i62 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i74, <4 x double> %res.i63, <4 x double> <double 0xBF56C16C16C14C96, double 0xBF56C16C16C14C96, double 0xBF56C16C16C14C96, double 0xBF56C16C16C14C96>)
 1985   %res.i61 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i74, <4 x double> %res.i62, <4 x double> <double 0x3FA5555555555545, double 0x3FA5555555555545, double 0x3FA5555555555545, double 0x3FA5555555555545>)
 1986   %res.i60 = call <4 x double> @llvm.fmuladd.v4f64(<4 x double> %res.i74, <4 x double> %res.i61, <4 x double> <double -5.000000e-01, double -5.000000e-01, double -5.000000e-01, double -5.000000e-01>)
 1987   %res.i59 = fmul fast <4 x double> %res.i60, %res.i74
 1988   %res.i58 = fadd fast <4 x double> %res.i59, <double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00>
 1989   %49 = extractelement <4 x double> %res.i84, i32 0
 1990   %50 = fptosi double %49 to i64
 1991   %51 = extractelement <4 x double> %res.i84, i32 1
 1992   %52 = fptosi double %51 to i64
 1993   %53 = extractelement <4 x double> %res.i84, i32 2
 1994   %54 = fptosi double %53 to i64
 1995   %55 = extractelement <4 x double> %res.i84, i32 3
 1996   %56 = fptosi double %55 to i64
 1997   %57 = insertelement <4 x i64> undef, i64 %50, i32 0
 1998   %58 = insertelement <4 x i64> %57, i64 %52, i32 1
 1999   %59 = insertelement <4 x i64> %58, i64 %54, i32 2
 2000   %60 = insertelement <4 x i64> %59, i64 %56, i32 3
 2001   %61 = trunc <4 x i64> %60 to <4 x i1>
 2002   %res.i52 = select <4 x i1> %61, <4 x double> %res.i66, <4 x double> %res.i58
 2003   %res.i50 = select <4 x i1> %61, <4 x double> %res.i58, <4 x double> %res.i66
 2004   %res.i48 = and <4 x i64> %60, <i64 2, i64 2, i64 2, i64 2>
 2005   %res.i46 = icmp eq <4 x i64> %res.i48, zeroinitializer
 2006   %res.i45 = fsub fast <4 x double> <double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00>, %res.i50
 2007   %res.i44 = select <4 x i1> %res.i46, <4 x double> %res.i50, <4 x double> %res.i45
 2008   %res.i42 = add nsw <4 x i64> %60, <i64 1, i64 1, i64 1, i64 1>
 2009   %res.i41 = and <4 x i64> %res.i42, <i64 2, i64 2, i64 2, i64 2>
 2010   %res.i40 = icmp eq <4 x i64> %res.i41, zeroinitializer
 2011   %res.i39 = fsub fast <4 x double> <double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00>, %res.i52
 2012   %res.i38 = select <4 x i1> %res.i40, <4 x double> %res.i52, <4 x double> %res.i39
 2013   %62 = inttoptr i64 %16 to i8*
 2014   %63 = getelementptr i8, i8* %62, i64 %value_phi10
 2015   %ptr.i35 = bitcast i8* %63 to <4 x double>*
 2016   %res.i36 = call <4 x double> @llvm.masked.load.v4f64.p0v4f64(<4 x double>* %ptr.i35, i32 8, <4 x i1> %mask.i92, <4 x double> zeroinitializer)
 2017   %res.i34 = fmul fast <4 x double> %res.i44, %res.i36
 2018   %res.i33 = fadd fast <4 x double> %res.i34, %value_phi8
 2019   %res.i32 = select <4 x i1> %mask.i92, <4 x double> %res.i33, <4 x double> %value_phi8
 2020   %res.i29 = fmul fast <4 x double> %res.i38, %res.i36
 2021   %res.i28 = fadd fast <4 x double> %res.i29, %value_phi9
 2022   %res.i27 = select <4 x i1> %mask.i92, <4 x double> %res.i28, <4 x double> %value_phi9
 2023   br label %L427
 2024
 2025 L427:                                             ; preds = %L223, %L228
 2026   %value_phi11 = phi <4 x double> [ %res.i32, %L228 ], [ %value_phi8, %L223 ]
 2027   %value_phi12 = phi <4 x double> [ %res.i27, %L228 ], [ %value_phi9, %L223 ]
 2028   %vec_2_1.i20 = shufflevector <4 x double> %value_phi12, <4 x double> undef, <2 x i32> <i32 0, i32 1>
 2029   %vec_2_2.i21 = shufflevector <4 x double> %value_phi12, <4 x double> undef, <2 x i32> <i32 2, i32 3>
 2030   %vec_2.i22 = fadd <2 x double> %vec_2_1.i20, %vec_2_2.i21
 2031   %vec_1_1.i23 = shufflevector <2 x double> %vec_2.i22, <2 x double> undef, <1 x i32> zeroinitializer
 2032   %vec_1_2.i24 = shufflevector <2 x double> %vec_2.i22, <2 x double> undef, <1 x i32> <i32 1>
 2033   %vec_1.i25 = fadd <1 x double> %vec_1_1.i23, %vec_1_2.i24
 2034   %res.i26 = extractelement <1 x double> %vec_1.i25, i32 0
 2035   %vec_2_1.i = shufflevector <4 x double> %value_phi11, <4 x double> undef, <2 x i32> <i32 0, i32 1>
 2036   %vec_2_2.i = shufflevector <4 x double> %value_phi11, <4 x double> undef, <2 x i32> <i32 2, i32 3>
 2037   %vec_2.i = fadd <2 x double> %vec_2_1.i, %vec_2_2.i
 2038   %vec_1_1.i = shufflevector <2 x double> %vec_2.i, <2 x double> undef, <1 x i32> zeroinitializer
 2039   %vec_1_2.i = shufflevector <2 x double> %vec_2.i, <2 x double> undef, <1 x i32> <i32 1>
 2040   %vec_1.i = fadd <1 x double> %vec_1_1.i, %vec_1_2.i
 2041   %res.i19 = extractelement <1 x double> %vec_1.i, i32 0
 2042   %.sroa.0.0..sroa_idx = getelementptr inbounds [2 x double], [2 x double]* %0, i64 0, i64 0
 2043   store double %res.i19, double* %.sroa.0.0..sroa_idx, align 8
 2044   %.sroa.2.0..sroa_idx14 = getelementptr inbounds [2 x double], [2 x double]* %0, i64 0, i64 1
 2045   store double %res.i26, double* %.sroa.2.0..sroa_idx14, align 8
 2046   ret void
 2047 }

The key difference to note here is that the vectorizer version is full of <4 x double> vectors, as it is calculating 4 loop iterations at a time. The SLEEF version is calculating 1 loop iteration at a time (the [2 x double] that you see is the sin and cosine).

Note that this vectorized version is incompatible with the if cedg[l] != 0 check, meaning you’d have to remove that check to use it. I already benchmarked it, and found that doing this was much slower than the serial version with the check.

Also, note that LoopVectorization is actually using basically the exact same code as SLEEF. The problem with SLEEF is that it does not inline the sincos_fast call, meaning the loop has to be a bunch of independent one-at-a-time calls to sincos_fast. Otherwise, they’re actually running the same implementation.

baggepinnen · November 21, 2019, 1:54am

Maybe a speedup could be gained by extracting all non-zero elements of coef/cedg into a contiguous array (first part of cedg can be reused for this, no allocation needed) and running a vectorized loop over this array without branches?

baggepinnen · November 21, 2019, 3:24am

I couldn’t help but trying

using LoopVectorization
function main(Cmin::Float64,Cmax::Float64,N::Int64,n::Int64,outdeg::Int64,Pᵣ::Float64,Pᵢ::Float64,P::Int64,μ::Float64,σ::Float64)
    A=Arr(0,[1:N;],Array{Float64}(undef,N),Array{Float64}(undef,N),watts_strogatz(N,outdeg,Pᵣ,is_directed=true),
    Array{Float64}(undef,N),Array{Int64}(undef,round(Int,P)),Array{Float64}(undef,N))
    rand!(Normal(μ,σ),A.θ)
    rand!(Uniform(Cmin,Cmax),A.C)
    A.W.=1.0.-A.C
    A.W./=sum(A.W)
    θₜ=Vector{Float64}[]
    θbuf = similar(A.θ)
    while any(abs(x - A.θ[1])>0.01 for x in A.θ)
        A.t+=1
        push!(θₜ,copy(A.θ))
        sample!(Random.GLOBAL_RNG,A.agents,Weights(A.W),A.upagents,replace=false)
        @inbounds for i in A.upagents
            A.cedg.=0.0
            @inbounds for j in inneighbors(A.G, i)
                if rand()<Pᵢ && i!=j
                    A.cedg[j]=exp(-abs(θₜ[A.t][i]-θₜ[A.t][j])/(1.0-A.C[i]))
                end
            end
            A.cedg./=(sum(A.cedg)/(1.0-A.C[i]))
            A.cedg[i]=A.C[i]
            cedg = A.cedg
            j = 1
            @inbounds for i in eachindex(cedg)
                cedg[j] = cedg[i]
                θbuf[j] = θₜ[A.t][i]
                j += ifelse(cedg[i] != 0, 1, 0)
            end
            s=0.0
            c=0.0
            @vvectorize for k in 1:j-1
                si,ci = sincos(θbuf[k])
                s += cedg[k]*si
                c += cedg[k]*ci
            end
            A.θ[i]=atan(s,c)
        end
    end
    return A.t,A.θ[1],sum(A.C.*θₜ[1])/sum(A.C)
    #return θₜ
end

@btime main(Cmin,Cmax,N,n,outdeg,Pᵣ,Pᵢ,P,μ,σ)

411.616 ms (14521 allocations: 6.88 MiB) # sin, cos
337.641 ms (14464 allocations: 6.58 MiB) # sincos
302.451 ms (14501 allocations: 6.78 MiB) # @vvectorize with branch removed

Not a huge gain, but I might have done this suboptimally.

This loop

            @inbounds for i in eachindex(cedg)
                cedg[j] = cedg[i]
                θbuf[j] = θₜ[A.t][i]
                j += ifelse(cedg[i] != 0, 1, 0)
            end

turned out to be slightly faster than

            @inbounds for i in eachindex(cedg)
                if cedg[i] != 0
                    cedg[j] = cedg[i]
                    θbuf[j] = θₜ[A.t][i]
                    j += 1
                end
            end

Btw Chris, your package LoopVectorization is hard to install, one has to install quite a few packages, PR with instructions

ennvvy · November 21, 2019, 4:48am

I get this error while trying to add the package:

ERROR: The following package names could not be resolved:
 * LoopVectorization (not found in project, manifest or registry)
Please specify by known `name=uuid`.

baggepinnen · November 21, 2019, 5:08am

Hmm, can it be that the Pkg.add("") syntax differs from pkg> add perhaps? Can you try the second version and see if that works?

ennvvy · November 21, 2019, 5:10am

Pkg.add("") yields

julia> Pkg.add("LoopVectorization")
ERROR: The following package names could not be resolved:
 * LoopVectorization (not found in project, manifest or registry)
Please specify by known `name=uuid`.
Stacktrace:
 [1] pkgerror(::String) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.3/Pkg/src/Types.jl:113
 [2] #ensure_resolved#101(::Bool, ::typeof(Pkg.Types.ensure_resolved), ::Pkg.Types.EnvCache, ::Array{Pkg.Types.PackageSpec,1}) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.3/Pkg/src/Types.jl:936
 [3] #ensure_resolved at ./none:0 [inlined]
 [4] #add#25(::Bool, ::Pkg.BinaryPlatforms.MacOS, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(Pkg.API.add), ::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.3/Pkg/src/API.jl:97
 [5] add(::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.3/Pkg/src/API.jl:72
 [6] #add#24 at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.3/Pkg/src/API.jl:69 [inlined]
 [7] add at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.3/Pkg/src/API.jl:69 [inlined]
 [8] #add#21 at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.3/Pkg/src/API.jl:67 [inlined]
 [9] add at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.3/Pkg/src/API.jl:67 [inlined]
 [10] #add#20(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(Pkg.API.add), ::String) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.3/Pkg/src/API.jl:66
 [11] add(::String) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.3/Pkg/src/API.jl:66
 [12] top-level scope at none:0

While pkg> add yields:

(v1.3) pkg> add LoopVectorization
  Updating registry at `~/.julia/registries/General`
  Updating git-repo `https://github.com/JuliaRegistries/General.git`
ERROR: The following package names could not be resolved:
 * LoopVectorization (not found in project, manifest or registry)
Please specify by known `name=uuid`.

Topic		Replies	Views
Optimising code: Multiplying a list of matrices by a matrix New to Julia question , linearalgebra , tensors	15	2163	December 4, 2019
Optimizing Calculation in Julia compared to C (New to Julia) Performance	25	2485	January 1, 2020
Efficient approach to multiply three matrices (M1M2M3) and two vectors and a matrix (xMy) Performance linearalgebra	18	5329	July 10, 2023
Variable declaration and initialisation New to Julia	13	2817	October 16, 2019
Improving performance for matrix assembly General Usage	20	580	February 9, 2023

How to make this piece of code faster? optimising it or using multiple threads?

Related topics