Can this be written even faster? (CPU)

Hello!

My usual experience has been that when I think something is as fast as I think it can be, then someone shows me how to do it even faster :slight_smile:

I’ve made a script attached below with a serial version and different ways of multi-threading. The results I get from serial, threaded etc. are as follows:

serial: Trial(953.400 μs)
@threads: Trial(254.000 μs)
@threads :static :Trial(248.900 μs)
Polyester.@batch: Trial(239.600 μs)

I know Polyester is recommended when it works to reduce allocations, but I was wondering if there was an even faster way to calculate this. For me 240 μs is slow. I need to calculate this function a lot of times, so any reduction in time is worthwhile.

Kind regards

using BenchmarkTools
using Polyester
using StaticArrays
using LoopVectorization

# Generate the needed constants
N     = 6000
Cb    = 100000
xᵢⱼ   = rand(SVector{3,Float64},N*4)
list  = [(rand(1:N), rand(1:N), rand()) for _ in 1:N*4]
ρ₀    = 1000
γ⁻¹   = 1/7
g     = 9.81
drhopLp = zeros(length(list))
drhopLn = zeros(length(list))


# In this code, use of multi-threading to calculate the heavy part for density diffusion to work
function  serial_approach(Cb,xᵢⱼ,list,ρ₀,γ⁻¹,g,drhopLp,drhopLn)
    for iter in eachindex(list)
        xⱼᵢ   = -xᵢⱼ[iter]
        Pᵢⱼᴴ  = ρ₀ * (-g) * xⱼᵢ[2]
        ρᵢⱼᴴ  = ρ₀ * ( ^( 1 + (Pᵢⱼᴴ/Cb), γ⁻¹) - 1)
        Pⱼᵢᴴ  = -Pᵢⱼᴴ
        ρⱼᵢᴴ  = ρ₀ * ( ^( 1 + (Pⱼᵢᴴ/Cb), γ⁻¹) - 1)
        drhopLp[iter] = ρᵢⱼᴴ
        drhopLn[iter] = ρⱼᵢᴴ
    end
end


# In this code, use of multi-threading to calculate the heavy part for density diffusion to work
function threads_approach(Cb,xᵢⱼ,list,ρ₀,γ⁻¹,g,drhopLp,drhopLn)
    Base.Threads.@threads for iter in eachindex(list)
        xⱼᵢ   = -xᵢⱼ[iter]
        Pᵢⱼᴴ  = ρ₀ * (-g) * xⱼᵢ[2]
        ρᵢⱼᴴ  = ρ₀ * ( ^( 1 + (Pᵢⱼᴴ/Cb), γ⁻¹) - 1)
        Pⱼᵢᴴ  = -Pᵢⱼᴴ
        ρⱼᵢᴴ  = ρ₀ * ( ^( 1 + (Pⱼᵢᴴ/Cb), γ⁻¹) - 1)
        drhopLp[iter] = ρᵢⱼᴴ
        drhopLn[iter] = ρⱼᵢᴴ
    end
end

function threads_static_approach(Cb,xᵢⱼ,list,ρ₀,γ⁻¹,g,drhopLp,drhopLn)
    Base.Threads.@threads :static for iter in eachindex(list)
        xⱼᵢ   = -xᵢⱼ[iter]
        Pᵢⱼᴴ  = ρ₀ * (-g) * xⱼᵢ[2]
        ρᵢⱼᴴ  = ρ₀ * ( ^( 1 + (Pᵢⱼᴴ/Cb), γ⁻¹) - 1)
        Pⱼᵢᴴ  = -Pᵢⱼᴴ
        ρⱼᵢᴴ  = ρ₀ * ( ^( 1 + (Pⱼᵢᴴ/Cb), γ⁻¹) - 1)
        drhopLp[iter] = ρᵢⱼᴴ
        drhopLn[iter] = ρⱼᵢᴴ
    end
end

function polyester_approach(Cb,xᵢⱼ,list,ρ₀,γ⁻¹,g,drhopLp,drhopLn)
    @batch for iter in eachindex(list)
        xⱼᵢ   = -xᵢⱼ[iter]
        Pᵢⱼᴴ  = ρ₀ * (-g) * xⱼᵢ[2]
        ρᵢⱼᴴ  = ρ₀ * ( ^( 1 + (Pᵢⱼᴴ/Cb), γ⁻¹) - 1)
        Pⱼᵢᴴ  = -Pᵢⱼᴴ
        ρⱼᵢᴴ  = ρ₀ * ( ^( 1 + (Pⱼᵢᴴ/Cb), γ⁻¹) - 1)
        drhopLp[iter] = ρᵢⱼᴴ
        drhopLn[iter] = ρⱼᵢᴴ
    end
end


function loopvectorization_approach(Cb,xᵢⱼ,list,ρ₀,γ⁻¹,g,drhopLp,drhopLn)
    @tturbo for iter in eachindex(list)
        xⱼᵢ   = -xᵢⱼ[iter]
        Pᵢⱼᴴ  = ρ₀ * (-g) * xⱼᵢ[2]
        ρᵢⱼᴴ  = ρ₀ * ( ^( 1 + (Pᵢⱼᴴ/Cb), γ⁻¹) - 1)
        Pⱼᵢᴴ  = -Pᵢⱼᴴ
        ρⱼᵢᴴ  = ρ₀ * ( ^( 1 + (Pⱼᵢᴴ/Cb), γ⁻¹) - 1)
        drhopLp[iter] = ρᵢⱼᴴ
        drhopLn[iter] = ρⱼᵢᴴ
    end
end

s  = @benchmark serial_approach($Cb,$xᵢⱼ,$list,$ρ₀,$γ⁻¹,$g,$drhopLp,$drhopLn);
t  = @benchmark threads_approach($Cb,$xᵢⱼ,$list,$ρ₀,$γ⁻¹,$g,$drhopLp,$drhopLn);
ts = @benchmark threads_static_approach($Cb,$xᵢⱼ,$list,$ρ₀,$γ⁻¹,$g,$drhopLp,$drhopLn);
p  = @benchmark polyester_approach($Cb,$xᵢⱼ,$list,$ρ₀,$γ⁻¹,$g,$drhopLp,$drhopLn);
# why @tturbo no work?
#l  = @benchmark loopvectorization_approach($Cb,$xᵢⱼ,$list,$ρ₀,$γ⁻¹,$g,$drhopLp,$drhopLn);


println(s,t,ts,p)

Maybe try to put const in front of those global variables?

It appears that he is passing all the globals to the functions, so the functions do not access the globals and wouldn’t incur the performance degradation due to such access, but your suggestion is good practice anyway.

2 Likes

You’re right. Sorry I didn’t notice that.

1 Like

Is Float32 precision sufficient? If so, that will be a free 2-3x boost.

1 Like

Your seventh roots seem to be the only operation that could take long. Do you know anything a priori about the arguments? Could you tabulate the values? Could you use an approximation?

Maybe you are just hitting memory bandwidth limits and you need to rethink approaching the problem at a higher level so that you can do more calculation on smaller parts of the array. Oscar’s suggestion addresses memory bandwidth by decreasing the size of the elements.

I would prefer to stay in Float64 for now :blush:

The constants Cb, rho0 and gamma I know before hand. Unfortunately it is not feasibile with an approximation afaik - atleast want to avoid that for now.

I’ve been trying to get a @tturbo version to work but unfortunately it allocates for some reason:

using BenchmarkTools
using Polyester
using StaticArrays
using LoopVectorization

# Generate the needed constants
N     = 6000
Cb    = 100000
xᵢⱼᶻ  = rand(Float64,N*4)
xᵢⱼ   = rand(SVector{3,Float64},N*4)
list  = [(rand(1:N), rand(1:N), rand()) for _ in 1:N*4]
ρ₀    = 1000
γ⁻¹   = 1/7
g     = 9.81
drhopLp = zeros(length(list))
drhopLn = zeros(length(list))

function loopvectorization_approach(Cb,xᵢⱼᶻ,ρ₀,γ⁻¹,g,drhopLp,drhopLn)
    @tturbo for iter in eachindex(xᵢⱼᶻ)
        # IF I COMMENT '* -xᵢⱼᶻ[iter]' NO ALLOCATIONS, WHY?
        Pᵢⱼᴴ  = ρ₀ * (-g) * -xᵢⱼᶻ[iter]
        ρᵢⱼᴴ  = ρ₀ * ( ^( 1 + (Pᵢⱼᴴ/Cb), γ⁻¹) - 1)
        Pⱼᵢᴴ  = -Pᵢⱼᴴ
        ρⱼᵢᴴ  = ρ₀ * ( ^( 1 + (Pⱼᵢᴴ/Cb), γ⁻¹) - 1)
        
        drhopLp[iter] = ρᵢⱼᴴ
        drhopLn[iter] = ρⱼᵢᴴ
    end
end

l  = @benchmark loopvectorization_approach($Cb,$xᵢⱼᶻ,$ρ₀,$γ⁻¹,$g,$drhopLp,$drhopLn);

I’ve introduced a new vector since I couldn’t double index into StaticArray using LoopVectorization. When I comment out the part mentioned in green then no allocs, but if I keep it in a lot of allocs:

BenchmarkTools.Trial: 1371 samples with 1 evaluation.
 Range (min … max):  1.542 ms … 64.285 ms  ┊ GC (min … max):  0.00% … 41.45%
 Time  (median):     2.444 ms              ┊ GC (median):     0.00%
 Time  (mean ± σ):   3.648 ms ±  5.515 ms  ┊ GC (mean ± σ):  29.74% ± 18.84%

  ▆██▅▃▃
  ██████▆▅▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▆██▅▆▆▆▆▅ █
  1.54 ms      Histogram: log(frequency) by time     26.8 ms <

 Memory estimate: 8.24 MiB, allocs estimate: 132000.

BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  2.467 μs … 11.067 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.600 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.986 μs ±  1.229 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆█▆▂  ▁▁▂▁                                               ▄ ▂
  ██████████▇▇█▇▇▆▇▄▅▅▄▄▄▄▁▃▁▁▄▃▃▄▁▃▄▁▃▃▁▄▁▁▃▁▁▁▁▁▁▄▃▁▁▁▁▇▄█ █
  2.47 μs      Histogram: log(frequency) by time     7.18 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Would anyone happen to know how to deal with that?

Kind regards

Could the allocations you’re seeing be related to this?

you can try converting ρ₀ = 1000 to a float before you compute the inner sum. I think the Julia compiler seems to give up more easily on inferring types in Julia 1.10

That will still infer fine. That said, it will be faster to work with inv(ρ₀) since multiplication is faster than division.

1 Like

Updated the code to strictly use Float64:

using BenchmarkTools
using StaticArrays
using LoopVectorization

# Generate the needed constants
N     = 6000
Cb    = Float64(100000)
xᵢⱼᶻ  = rand(Float64,N*4)
xᵢⱼ   = rand(SVector{3,Float64},N*4)
ρ₀    = Float64(1000)
γ⁻¹   = Float64(1/7)
g     = Float64(9.81)
drhopLp = zeros(Float64, length(list))
drhopLn = zeros(Float64, length(list))

function loopvectorization_approach(Cb,xᵢⱼᶻ,ρ₀,γ⁻¹,g,drhopLp,drhopLn)
    @tturbo for iter in eachindex(xᵢⱼᶻ)
        # IF I COMMENT '* -xᵢⱼᶻ[iter]' NO ALLOCATIONS, WHY?
        Pᵢⱼᴴ  = ρ₀ * (-g) * -xᵢⱼᶻ[iter]
        ρᵢⱼᴴ  = ρ₀ * ( ^( 1 + (Pᵢⱼᴴ/Cb), γ⁻¹) - 1)
        Pⱼᵢᴴ  = -Pᵢⱼᴴ
        ρⱼᵢᴴ  = ρ₀ * ( ^( 1 + (Pⱼᵢᴴ/Cb), γ⁻¹) - 1)
        
        drhopLp[iter] = ρᵢⱼᴴ
        drhopLn[iter] = ρⱼᵢᴴ
    end
end

l  = @benchmark loopvectorization_approach($Cb,$xᵢⱼᶻ,$ρ₀,$γ⁻¹,$g,$drhopLp,$drhopLn);

display(l)

BenchmarkTools.Trial: 1246 samples with 1 evaluation.
 Range (min … max):  1.339 ms … 59.475 ms  ┊ GC (min … max):  0.00% … 36.11%
 Time  (median):     2.644 ms              ┊ GC (median):     0.00%
 Time  (mean ± σ):   4.011 ms ±  5.803 ms  ┊ GC (mean ± σ):  28.21% ± 18.78%

  ▂█▇▇▄▂
  ██████▁▄▁▄▁▄▁▁▁▁▅▁▁▄▁▁▁▁▄▁▁▁▁▁▁▁▁▁▄▁▁▄▅▇▇▆▆▇▄▅▄▄▅▁▄▁▁▄▄▄▄▅ █
  1.34 ms      Histogram: log(frequency) by time     32.9 ms <

 Memory estimate: 8.24 MiB, allocs estimate: 132000.

Issue with allocations persist.

Sorry I didn’t get this, you meant Cb right?

Kind regards

LoopVectorization often allocates on Julia 1.10.
Can you try 1.9?

I haven’t looked at this.

2 Likes

The difference between inv(Cb) and Cb is 0.040 ms, nice!

Thank you :blush: In my case this does add up.

2 Likes

@Elrod, I would never have thought of this myself :sweat_smile:

Using Julia 1.9.4:

using BenchmarkTools
using StaticArrays
using LoopVectorization

# Generate the needed constants
N     = 6000
Cb    = inv(Float64(100000))
xᵢⱼᶻ  = rand(Float64,N*4)
xᵢⱼ   = rand(SVector{3,Float64},N*4)
ρ₀    = Float64(1000)
γ⁻¹   = Float64(1/7)
g     = Float64(9.81)
drhopLp = zeros(Float64, length(xᵢⱼ))
drhopLn = zeros(Float64, length(xᵢⱼ))

function loopvectorization_approach(Cb,xᵢⱼᶻ,ρ₀,γ⁻¹,g,drhopLp,drhopLn)
    @tturbo for iter in eachindex(xᵢⱼᶻ)
        # IF I COMMENT '* -xᵢⱼᶻ[iter]' NO ALLOCATIONS, WHY?
        Pᵢⱼᴴ  = ρ₀ * (-g) * -xᵢⱼᶻ[iter]
        ρᵢⱼᴴ  = ρ₀ * ( ^( 1 + (Pᵢⱼᴴ * Cb), γ⁻¹) - 1)
        Pⱼᵢᴴ  = -Pᵢⱼᴴ
        ρⱼᵢᴴ  = ρ₀ * ( ^( 1 + (Pⱼᵢᴴ * Cb), γ⁻¹) - 1)
        
        drhopLp[iter] = ρᵢⱼᴴ
        drhopLn[iter] = ρⱼᵢᴴ
    end
end

l  = @benchmark loopvectorization_approach($Cb,$xᵢⱼᶻ,$ρ₀,$γ⁻¹,$g,$drhopLp,$drhopLn);

display(l)
julia> display(l)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  163.000 μs … 556.700 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     172.900 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   178.892 μs ±  27.539 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▆▆▂█▅▄▇▄▃▂▂▃▁▃▁ ▁▁▃ ▂▂                                      ▂
  ▆▁███████████████▇███████▆▆▆▆▆▇▆▅▆▅▅▅▆▅▆▃▄▄▄▅▃▄▃▆▄▄▅▇▆▆▆▅▆▅▆▅ █
  163 μs        Histogram: log(frequency) by time        266 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Would love if this could work on version 1.10, but I suppose my best option for now is to go back to 1.9? :confused:

EDIT: The allocations kill my performance since I am doing particle simulation stuff.

Kind regards

1 Like

Can you verify with code_warntype whether the allocations come from a type instability in the LoopVectorization version?
If yes, you might still make it work in 1.10 by converting some the other integers to floats explicitly.
It’s a bit unjulian, but if it works…

code_warntype

MethodInstance for loopvectorization_approach(::Float64, ::Vector{Float64}, ::Float64, ::Float64, ::Float64, ::Vector{Float64}, ::Vector{Float64})
  from loopvectorization_approach(Cb, xᵢⱼᶻ, ρ₀, γ⁻¹, g, drhopLp, drhopLn) @ Main REPL[33]:1
Arguments
  #self#::Core.Const(loopvectorization_approach)
  Cb::Float64
  xᵢⱼᶻ::Vector{Float64}
  ρ₀::Float64
  γ⁻¹::Float64
  g::Float64
  drhopLp::Vector{Float64}
  drhopLn::Vector{Float64}
Locals
  @_9::Union{}
  val@_10::Union{}
  @_11::Int64
  @_12::Int64
  @_13::Int64
  val@_14::Base.OneTo{Int64}
  Wvecwidth##::Static.StaticInt{4}
  Tloopeltype##::Type{Float64}
  vargsym#233::Tuple{Tuple{CloseOpenIntervals.CloseOpen{Static.StaticInt{0}, Int64}}, Tuple{LayoutPointers.GroupedStridedPointers{Tuple{Ptr{Float64}, Ptr{Float64}, Ptr{Float64}}, (1, 1, 1), (0, 0, 0), ((1,), (1,), (1,)), ((1,), (2,), (3,)), Tuple{Static.StaticInt{8}, Static.StaticInt{8}, Static.StaticInt{8}}, Tuple{Static.StaticInt{0}, Static.StaticInt{0}, Static.StaticInt{0}}}, Vararg{Float64, 4}}}
  ##grouped#strided#pointer####19###::LayoutPointers.GroupedStridedPointers{Tuple{Ptr{Float64}, Ptr{Float64}, Ptr{Float64}}, (1, 1, 1), (0, 0, 0), ((1,), (1,), (1,)), ((1,), (2,), (3,)), Tuple{Static.StaticInt{8}, Static.StaticInt{8}, Static.StaticInt{8}}, Tuple{Static.StaticInt{0}, Static.StaticInt{0}, Static.StaticInt{0}}}
  preserve#buffer#@_19::Vector{Float64}
  vptr##_drhopLn::LayoutPointers.StridedPointer{Float64, 1, 1, 0, (1,), Tuple{Static.StaticInt{8}}, Tuple{Static.StaticInt{1}}}
  preserve#buffer#@_21::Vector{Float64}
  vptr##_drhopLp::LayoutPointers.StridedPointer{Float64, 1, 1, 0, (1,), Tuple{Static.StaticInt{8}}, Tuple{Static.StaticInt{1}}}
  #loopconstnumber###12###::Int64
  preserve#buffer#@_24::Vector{Float64}
  vptr##_xᵢⱼᶻ::LayoutPointers.StridedPointer{Float64, 1, 1, 0, (1,), Tuple{Static.StaticInt{8}}, Tuple{Static.StaticInt{1}}}
  #temp###5###::Float64
  #temp###6###::Float64
  #iter_loop_upper_bound###4###::Int64
  #iter_loop_lower_bound###3###::Static.StaticInt{1}
  #loopleniter###2###::Int64
  #looprangeiter###1###::Static.OptionallyStaticUnitRange{Static.StaticInt{1}, Int64}
  msg::Union{}
  kwargs::Union{}
  line::Union{}
  file::Union{}
  id::Union{}
  logger::Union{}
  _module::Union{}
  group::Union{}
  std_level::Union{}
  level::Union{}
  err::Union{}
  iter::Union{}
  ρⱼᵢᴴ::Union{}
  Pⱼᵢᴴ::Union{}
  ρᵢⱼᴴ::Union{}
  Pᵢⱼᴴ::Union{}
  @_48::Union{}
Body::Nothing
1 ──        Core.NewvarNode(:(@_9))
│           Core.NewvarNode(:(val@_10))
│           Core.NewvarNode(:(@_11))
│           Core.NewvarNode(:(@_12))
│           Core.NewvarNode(:(@_13))
│           Core.NewvarNode(:(Wvecwidth##))
│           Core.NewvarNode(:(Tloopeltype##))
│           Core.NewvarNode(:(vargsym#233))
│           Core.NewvarNode(:(##grouped#strided#pointer####19###))
│           Core.NewvarNode(:(preserve#buffer#@_19))
│           Core.NewvarNode(:(vptr##_drhopLn))
│           Core.NewvarNode(:(preserve#buffer#@_21))
│           Core.NewvarNode(:(vptr##_drhopLp))
│           Core.NewvarNode(:(#loopconstnumber###12###))
│           Core.NewvarNode(:(preserve#buffer#@_24))
│           Core.NewvarNode(:(vptr##_xᵢⱼᶻ))
│           Core.NewvarNode(:(#temp###5###))
│           Core.NewvarNode(:(#temp###6###))
│           Main.nothing
│           Main.nothing
│           Main.nothing
│           Main.nothing
│           Main.nothing
│           Main.nothing
│           Main.nothing
│           Main.nothing
│           nothing
│           (val@_14 = Main.eachindex(xᵢⱼᶻ))
│           nothing
│    %30  = val@_14::Base.OneTo{Int64}
│           (#looprangeiter###1### = LoopVectorization.canonicalize_range(%30))
│           (#loopleniter###2### = StaticArrayInterface.static_length(#looprangeiter###1###))
│           (#iter_loop_lower_bound###3### = LoopVectorization.maybestaticfirst(#looprangeiter###1###))
│           (#iter_loop_upper_bound###4### = LoopVectorization.maybestaticlast(#looprangeiter###1###))
│    %35  = LoopVectorization.check_args(xᵢⱼᶻ, drhopLp, drhopLn)::Core.Const(true)
└───        goto #15 if not %35
2 ──        goto #15 if not true
3 ── %38  = (LoopVectorization.can_turbo)(LoopVectorization.sub_fast, Val{1}())::Core.Const(true)
└───        goto #15 if not %38
4 ── %40  = (LoopVectorization.can_turbo)(LoopVectorization.mul_fast, Val{2}())::Core.Const(true)
└───        goto #15 if not %40
5 ── %42  = (LoopVectorization.can_turbo)(LoopVectorization.vfmadd_fast, Val{3}())::Core.Const(true)
└───        goto #15 if not %42
6 ── %44  = (LoopVectorization.can_turbo)(LoopVectorization.pow_fast, Val{2}())::Core.Const(true)
└───        goto #15 if not %44
7 ── %46  = (LoopVectorization.can_turbo)(LoopVectorization.sub_fast, Val{2}())::Core.Const(true)
└───        goto #15 if not %46
8 ── %48  = (LoopVectorization.can_turbo)(LoopVectorization.mul_fast, Val{2}())::Core.Const(true)
└───        goto #15 if not %48
9 ── %50  = (LoopVectorization.can_turbo)(LoopVectorization.sub_fast, Val{1}())::Core.Const(true)
└───        goto #15 if not %50
10 ─ %52  = (LoopVectorization.can_turbo)(LoopVectorization.vfmadd_fast, Val{3}())::Core.Const(true)
└───        goto #15 if not %52
11 ─ %54  = (LoopVectorization.can_turbo)(LoopVectorization.pow_fast, Val{2}())::Core.Const(true)
└───        goto #15 if not %54
12 ─ %56  = (LoopVectorization.can_turbo)(LoopVectorization.sub_fast, Val{2}())::Core.Const(true)
└───        goto #15 if not %56
13 ─ %58  = (LoopVectorization.can_turbo)(LoopVectorization.mul_fast, Val{2}())::Core.Const(true)
└───        goto #15 if not %58
14 ─        (#temp###6### = LoopVectorization.sub_fast(g))
│           (#temp###5### = LoopVectorization.mul_fast(ρ₀, #temp###6###))
│    %62  = LoopVectorization.stridedpointer_preserve(xᵢⱼᶻ)::Tuple{LayoutPointers.StridedPointer{Float64, 1, 1, 0, (1,), Tuple{Static.StaticInt{8}}, Tuple{Static.StaticInt{1}}}, Vector{Float64}}
│    %63  = Base.indexed_iterate(%62, 1)::Core.PartialStruct(Tuple{LayoutPointers.StridedPointer{Float64, 1, 1, 0, (1,), Tuple{Static.StaticInt{8}}, Tuple{Static.StaticInt{1}}}, Int64}, Any[LayoutPointers.StridedPointer{Float64, 1, 1, 0, (1,), Tuple{Static.StaticInt{8}}, Tuple{Static.StaticInt{1}}}, Core.Const(2)])
│           (vptr##_xᵢⱼᶻ = Core.getfield(%63, 1))
│           (@_13 = Core.getfield(%63, 2))
│    %66  = Base.indexed_iterate(%62, 2, @_13::Core.Const(2))::Core.PartialStruct(Tuple{Vector{Float64}, Int64}, Any[Vector{Float64}, Core.Const(3)])
│           (preserve#buffer#@_24 = Core.getfield(%66, 1))
│           (#loopconstnumber###12### = 1)
│    %69  = LoopVectorization.stridedpointer_preserve(drhopLp)::Tuple{LayoutPointers.StridedPointer{Float64, 1, 1, 0, (1,), Tuple{Static.StaticInt{8}}, Tuple{Static.StaticInt{1}}}, Vector{Float64}}
│    %70  = Base.indexed_iterate(%69, 1)::Core.PartialStruct(Tuple{LayoutPointers.StridedPointer{Float64, 1, 1, 0, (1,), Tuple{Static.StaticInt{8}}, Tuple{Static.StaticInt{1}}}, Int64}, Any[LayoutPointers.StridedPointer{Float64, 1, 1, 0, (1,), Tuple{Static.StaticInt{8}}, Tuple{Static.StaticInt{1}}}, Core.Const(2)])
│           (vptr##_drhopLp = Core.getfield(%70, 1))
│           (@_12 = Core.getfield(%70, 2))
│    %73  = Base.indexed_iterate(%69, 2, @_12::Core.Const(2))::Core.PartialStruct(Tuple{Vector{Float64}, Int64}, Any[Vector{Float64}, Core.Const(3)])
│           (preserve#buffer#@_21 = Core.getfield(%73, 1))
│    %75  = LoopVectorization.stridedpointer_preserve(drhopLn)::Tuple{LayoutPointers.StridedPointer{Float64, 1, 1, 0, (1,), Tuple{Static.StaticInt{8}}, Tuple{Static.StaticInt{1}}}, Vector{Float64}}
│    %76  = Base.indexed_iterate(%75, 1)::Core.PartialStruct(Tuple{LayoutPointers.StridedPointer{Float64, 1, 1, 0, (1,), Tuple{Static.StaticInt{8}}, Tuple{Static.StaticInt{1}}}, Int64}, Any[LayoutPointers.StridedPointer{Float64, 1, 1, 0, (1,), Tuple{Static.StaticInt{8}}, Tuple{Static.StaticInt{1}}}, Core.Const(2)])
│           (vptr##_drhopLn = Core.getfield(%76, 1))
│           (@_11 = Core.getfield(%76, 2))
│    %79  = Base.indexed_iterate(%75, 2, @_11::Core.Const(2))::Core.PartialStruct(Tuple{Vector{Float64}, Int64}, Any[Vector{Float64}, Core.Const(3)])
│           (preserve#buffer#@_19 = Core.getfield(%79, 1))
│    %81  = vptr##_xᵢⱼᶻ::LayoutPointers.StridedPointer{Float64, 1, 1, 0, (1,), Tuple{Static.StaticInt{8}}, Tuple{Static.StaticInt{1}}}
│    %82  = Core.tuple(#iter_loop_lower_bound###3###)::Core.Const((static(1),))
│    %83  = LoopVectorization.gespf1(%81, %82)::LayoutPointers.StridedPointer{Float64, 1, 1, 0, (1,), Tuple{Static.StaticInt{8}}, Tuple{Static.StaticInt{0}}}
│    %84  = LoopVectorization.densewrapper(%83, xᵢⱼᶻ)::LayoutPointers.DensePointerWrapper{(true,), Float64, 1, 1, 0, (1,), Tuple{Static.StaticInt{8}}, Tuple{Static.StaticInt{0}}, LayoutPointers.StridedPointer{Float64, 1, 1, 0, (1,), Tuple{Static.StaticInt{8}}, Tuple{Static.StaticInt{0}}}}
│    %85  = vptr##_drhopLp::LayoutPointers.StridedPointer{Float64, 1, 1, 0, (1,), Tuple{Static.StaticInt{8}}, Tuple{Static.StaticInt{1}}}
│    %86  = Core.tuple(#iter_loop_lower_bound###3###)::Core.Const((static(1),))
│    %87  = LoopVectorization.gespf1(%85, %86)::LayoutPointers.StridedPointer{Float64, 1, 1, 0, (1,), Tuple{Static.StaticInt{8}}, Tuple{Static.StaticInt{0}}}
│    %88  = LoopVectorization.densewrapper(%87, drhopLp)::LayoutPointers.DensePointerWrapper{(true,), Float64, 1, 1, 0, (1,), Tuple{Static.StaticInt{8}}, Tuple{Static.StaticInt{0}}, LayoutPointers.StridedPointer{Float64, 1, 1, 0, (1,), Tuple{Static.StaticInt{8}}, Tuple{Static.StaticInt{0}}}}
│    %89  = vptr##_drhopLn::LayoutPointers.StridedPointer{Float64, 1, 1, 0, (1,), Tuple{Static.StaticInt{8}}, Tuple{Static.StaticInt{1}}}
│    %90  = Core.tuple(#iter_loop_lower_bound###3###)::Core.Const((static(1),))
│    %91  = LoopVectorization.gespf1(%89, %90)::LayoutPointers.StridedPointer{Float64, 1, 1, 0, (1,), Tuple{Static.StaticInt{8}}, Tuple{Static.StaticInt{0}}}
│    %92  = LoopVectorization.densewrapper(%91, drhopLn)::LayoutPointers.DensePointerWrapper{(true,), Float64, 1, 1, 0, (1,), Tuple{Static.StaticInt{8}}, Tuple{Static.StaticInt{0}}, LayoutPointers.StridedPointer{Float64, 1, 1, 0, (1,), Tuple{Static.StaticInt{8}}, Tuple{Static.StaticInt{0}}}}
│    %93  = Core.tuple(%84, %88, %92)::Tuple{LayoutPointers.DensePointerWrapper{(true,), Float64, 1, 1, 0, (1,), Tuple{Static.StaticInt{8}}, Tuple{Static.StaticInt{0}}, LayoutPointers.StridedPointer{Float64, 1, 1, 0, (1,), Tuple{Static.StaticInt{8}}, Tuple{Static.StaticInt{0}}}}, LayoutPointers.DensePointerWrapper{(true,), Float64, 1, 1, 0, (1,), Tuple{Static.StaticInt{8}}, Tuple{Static.StaticInt{0}}, LayoutPointers.StridedPointer{Float64, 1, 1, 0, (1,), Tuple{Static.StaticInt{8}}, Tuple{Static.StaticInt{0}}}}, LayoutPointers.DensePointerWrapper{(true,), Float64, 1, 1, 0, (1,), Tuple{Static.StaticInt{8}}, Tuple{Static.StaticInt{0}}, LayoutPointers.StridedPointer{Float64, 1, 1, 0, (1,), Tuple{Static.StaticInt{8}}, Tuple{Static.StaticInt{0}}}}}
│    %94  = Main.Val::Core.Const(Val)
│    %95  = ()::Core.Const(())
│    %96  = Core.apply_type(%94, %95)::Core.Const(Val{()})
│    %97  = (%96)()::Core.Const(Val{()}())
│    %98  = LoopVectorization.grouped_strided_pointer(%93, %97)::Tuple{LayoutPointers.GroupedStridedPointers{Tuple{Ptr{Float64}, Ptr{Float64}, Ptr{Float64}}, (1, 1, 1), (0, 0, 0), ((1,), (1,), (1,)), ((1,), (2,), (3,)), Tuple{Static.StaticInt{8}, Static.StaticInt{8}, Static.StaticInt{8}}, Tuple{Static.StaticInt{0}, Static.StaticInt{0}, Static.StaticInt{0}}}, Tuple{Nothing, Nothing, Nothing}}
│           (##grouped#strided#pointer####19### = (getfield)(%98, 1))
│    %100 = $(Expr(:gc_preserve_begin, :(preserve#buffer#@_24), :(preserve#buffer#@_21), :(preserve#buffer#@_19)))
│    %101 = LoopVectorization.zerorangestart(#looprangeiter###1###)::CloseOpenIntervals.CloseOpen{Static.StaticInt{0}, Int64}
│    %102 = Core.tuple(%101)::Tuple{CloseOpenIntervals.CloseOpen{Static.StaticInt{0}, Int64}}
│    %103 = Core.tuple(##grouped#strided#pointer####19###, ρ₀, #temp###5###, Cb, γ⁻¹)::Tuple{LayoutPointers.GroupedStridedPointers{Tuple{Ptr{Float64}, Ptr{Float64}, Ptr{Float64}}, (1, 1, 1), (0, 0, 0), ((1,), (1,), (1,)), ((1,), (2,), (3,)), Tuple{Static.StaticInt{8}, Static.StaticInt{8}, Static.StaticInt{8}}, Tuple{Static.StaticInt{0}, Static.StaticInt{0}, Static.StaticInt{0}}}, Vararg{Float64, 4}}
│           (vargsym#233 = Core.tuple(%102, %103))
│    %105 = LoopVectorization.eltype(xᵢⱼᶻ)::Core.Const(Float64)
│    %106 = LoopVectorization.eltype(drhopLp)::Core.Const(Float64)
│    %107 = LoopVectorization.eltype(drhopLn)::Core.Const(Float64)
│           (Tloopeltype## = LoopVectorization.promote_type(%105, %106, %107))
│           (Wvecwidth## = LoopVectorization.pick_vector_width(Tloopeltype##::Core.Const(Float64)))
│    %110 = LoopVectorization._turbo_!::Core.Const(LoopVectorization._turbo_!)
│    %111 = Core.apply_type(Main.Val, (false, 0, 0, 0, false, 0xffffffffffffffff, 1, true))::Core.Const(Val{(false, 0, 0, 0, false, 0xffffffffffffffff, 1, true)})
│    %112 = (%111)()::Core.Const(Val{(false, 0, 0, 0, false, 0xffffffffffffffff, 1, true)}())
│    %113 = LoopVectorization.avx_config_val(%112, Wvecwidth##)::Core.Const(Val{(false, 0, 0, 0, false, 4, 32, 15, 64, 0x0000000000000004, 1, true)}())
│    %114 = Main.Val::Core.Const(Val)
│    %115 = Core.tuple(:LoopVectorization, :LOOPCONSTANTINSTRUCTION, LoopVectorization.OperationStruct(0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.constant, 0x0001, 0x0000), Symbol("##DROPPED#CONSTANT##"), Symbol("##DROPPED#CONSTANT##"), LoopVectorization.OperationStruct(0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.constant, 0x0002, 0x0000), Symbol("##DROPPED#CONSTANT##"), Symbol("##DROPPED#CONSTANT##"), LoopVectorization.OperationStruct(0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.constant, 0x0003, 0x0000), :LoopVectorization, :LOOPCONSTANTINSTRUCTION, LoopVectorization.OperationStruct(0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.constant, 0x0004, 0x0000), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.memload, 0x0005, 0x0001), :LoopVectorization, :sub_fast, LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000005, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.compute, 0x0006, 0x0000), :LoopVectorization, :mul_fast, LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000040006, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.compute, 0x0007, 0x0000), :LoopVectorization, :LOOPCONSTANTINSTRUCTION, LoopVectorization.OperationStruct(0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.constant, 0x0008, 0x0000), :LoopVectorization, :LOOPCONSTANTINSTRUCTION, LoopVectorization.OperationStruct(0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.constant, 0x0009, 0x0000), :LoopVectorization, :vfmadd_fast, LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000700080009, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.compute, 0x000a, 0x0000), :LoopVectorization, :LOOPCONSTANTINSTRUCTION, LoopVectorization.OperationStruct(0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.constant, 0x000b, 0x0000), :LoopVectorization, :pow_fast, LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x000000000000000000000000000a000b, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.compute, 0x000c, 0x0000), :LoopVectorization, :sub_fast, LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x000000000000000000000000000c0009, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.compute, 0x000d, 0x0000), :LoopVectorization, :mul_fast, LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x0000000000000000000000000001000d, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.compute, 0x000e, 0x0000), :LoopVectorization, :sub_fast, LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000007, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.compute, 0x000f, 0x0000), :LoopVectorization, :vfmadd_fast, LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000f00080009, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.compute, 0x0010, 0x0000), :LoopVectorization, :pow_fast, LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x0000000000000000000000000010000b, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.compute, 0x0011, 0x0000), :LoopVectorization, :sub_fast, LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000110009, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.compute, 0x0012, 0x0000), :LoopVectorization, :mul_fast, LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000010012, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.compute, 0x0013, 0x0000), :LoopVectorization, :setindex!, LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x0000000000000000000000000000000e, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.memstore, 0x0014, 0x0002), :LoopVectorization, :setindex!, LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000013, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.memstore, 0x0015, 0x0003))::Core.Const((:LoopVectorization, :LOOPCONSTANTINSTRUCTION, LoopVectorization.OperationStruct(0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.constant, 0x0001, 0x0000), Symbol("##DROPPED#CONSTANT##"), Symbol("##DROPPED#CONSTANT##"), LoopVectorization.OperationStruct(0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.constant, 0x0002, 0x0000), Symbol("##DROPPED#CONSTANT##"), Symbol("##DROPPED#CONSTANT##"), LoopVectorization.OperationStruct(0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.constant, 0x0003, 0x0000), :LoopVectorization, :LOOPCONSTANTINSTRUCTION, LoopVectorization.OperationStruct(0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.constant, 0x0004, 0x0000), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.memload, 0x0005, 0x0001), :LoopVectorization, :sub_fast, LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000005, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.compute, 0x0006, 0x0000), :LoopVectorization, :mul_fast, LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000040006, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.compute, 0x0007, 0x0000), :LoopVectorization, :LOOPCONSTANTINSTRUCTION, LoopVectorization.OperationStruct(0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.constant, 0x0008, 0x0000), :LoopVectorization, :LOOPCONSTANTINSTRUCTION, LoopVectorization.OperationStruct(0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.constant, 0x0009, 0x0000), :LoopVectorization, :vfmadd_fast, LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000700080009, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.compute, 0x000a, 0x0000), :LoopVectorization, :LOOPCONSTANTINSTRUCTION, LoopVectorization.OperationStruct(0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.constant, 0x000b, 0x0000), :LoopVectorization, :pow_fast, LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x000000000000000000000000000a000b, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.compute, 0x000c, 0x0000), :LoopVectorization, :sub_fast, LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x000000000000000000000000000c0009, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.compute, 0x000d, 0x0000), :LoopVectorization, :mul_fast, LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x0000000000000000000000000001000d, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.compute, 0x000e, 0x0000), :LoopVectorization, :sub_fast, LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000007, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.compute, 0x000f, 0x0000), :LoopVectorization, :vfmadd_fast, LoopVectorization.OperationStruct(0x00000000000000000000000000000001, ...

julia> @report_opt target_modules=(@MODULE,) loopvectorization_approach(Cb,xᵢⱼᶻ,ρ₀,γ⁻¹,g,drhopLp,drhopLn)
No errors detected

Looks good to me?

kind regards

1 Like

Use Cthulhu to descend into _turbo_!.
The problem was in promote_type on the example I looked at before, so likely there again.

It is helpful to enable optimizations and warn types within descend.

1 Like

Okay I found the issue! It is something with the inverse of gamma in the power calculation. If I write it directly no issue anymore on Julia 1.10:

using BenchmarkTools
using StaticArrays
using LoopVectorization

# Generate the needed constants
N     = 6000
Cb    = inv(Float64(100000))
xᵢⱼᶻ  = rand(Float64,N*4)
xᵢⱼ   = rand(SVector{3,Float64},N*4)
ρ₀    = Float64(1000)
g     = Float64(9.81)
drhopLp = zeros(Float64, length(xᵢⱼ))
drhopLn = zeros(Float64, length(xᵢⱼ))

function loopvectorization_approach(Cb,xᵢⱼᶻ,ρ₀,γ⁻¹,g,drhopLp,drhopLn)
    @tturbo for iter in eachindex(xᵢⱼᶻ)
        # IF I COMMENT '* -xᵢⱼᶻ[iter]' NO ALLOCATIONS, WHY?
        Pᵢⱼᴴ  = ρ₀ * (-g) * -xᵢⱼᶻ[iter]
        ρᵢⱼᴴ  = ρ₀ * ( ^( 1 + (Pᵢⱼᴴ * Cb), 0.14285714285714285) - 1)  
        Pⱼᵢᴴ  = -Pᵢⱼᴴ
        ρⱼᵢᴴ  = ρ₀ * ( ^( 1 + (Pⱼᵢᴴ * Cb), 0.14285714285714285) - 1)
        
        drhopLp[iter] = ρᵢⱼᴴ
        drhopLn[iter] = ρⱼᵢᴴ
    end
end

l  = @benchmark loopvectorization_approach($Cb,$xᵢⱼᶻ,$ρ₀,$γ⁻¹,$g,$drhopLp,$drhopLn)

display(l)

julia> l  = @benchmark loopvectorization_approach($Cb,$xᵢⱼᶻ,$ρ₀,$γ⁻¹,$g,$drhopLp,$drhopLn)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  339.900 μs …  4.994 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     358.600 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   382.900 μs ± 97.110 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▅█▇▅▄▃▂▂▁▁▁▁▁          ▁▁▁▁                                 ▂
  █████████████████▇▇█▇█████████▇█▇▇▇▇▇▆▄▆▅▆▆▆▅▅▆▄▆▅▁▅▅▅▄▄▆▆▇▇ █
  340 μs        Histogram: log(frequency) by time       718 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Now on to see if I can trick it and use the variable.

EDIT: Both ways are Float64 values, I am confused :sweat_smile:

Kind regards

1 Like

If you’re really trying to squeeze every microsecond out of this, then it may be worth it to do

drhopLp = fill(ρ₀, length(xᵢⱼ))
drhopLn = fill(ρ₀, length(xᵢⱼ))

instead of creating them as arrays of zeroes. Then the last two lines in the loop would become

 drhopLp[iter] *= ρᵢⱼᴴ
 drhopLn[iter] *= ρⱼᵢᴴ

It is also possible to do ρ₀*g outside the loop as well, and in that same line, there are two multiplies by a negative, so those can probably be eliminated. That’s a couple of operations removed from the loop. Best to check that these things don’t change you’re results though.

3 Likes

Thanks for the hint, did some of it. Decided against fill, but will reconsider perhaps.

If anyone knows how to solve the issue with ^ in LoopVectorization, see: Can this be written even faster? (CPU) - #18 by Ahmed_Salih

I would be really happy :slight_smile:

Kind regards

1 Like

I don’t know why but this seems to work

julia> faux(ρ₀, P, Cb, γ⁻¹) = ρ₀ * ( ^( 1 + (P * Cb), γ⁻¹) - 1)
faux (generic function with 1 method)


julia> function loopvectorization_approach3(Cb,xᵢⱼᶻ,ρ₀,γ⁻¹,g,drhopLp,drhopLn)
           @tturbo for iter in eachindex(xᵢⱼᶻ)
               # IF I COMMENT '* -xᵢⱼᶻ[iter]' NO ALLOCATIONS, WHY?
               Pᵢⱼᴴ  = ρ₀ * (-g) * -xᵢⱼᶻ[iter]
               Pⱼᵢᴴ  = -Pᵢⱼᴴ
               ρᵢⱼᴴ  = faux(ρ₀, Pᵢⱼᴴ, Cb, γ⁻¹)
               ρⱼᵢᴴ  = faux(ρ₀, Pⱼᵢᴴ, Cb, γ⁻¹)
               drhopLp[iter] = ρᵢⱼᴴ
               drhopLn[iter] = ρⱼᵢᴴ
           end
       end
loopvectorization_approach3 (generic function with 2 methods)


julia> l3  = @benchmark loopvectorization_approach3($Cb,$xᵢⱼᶻ,$ρ₀, $γ⁻¹, $g,$drhopLp,$drhopLn)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  35.100 μs … 557.100 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     73.200 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   73.923 μs ±  12.997 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃▂                                     █▄▂▆▆▃▂▂▁             ▂
  ███▆▅▅▁▁▁▃▁▁▃▃▁▁▁▄▃▁▁▃▁▁▁▁▁▁▁▃▃▁▁▁▁▁▁▄▆██████████▇▆▇▅▆▅▆▅▁▆▆ █
  35.1 μs       Histogram: log(frequency) by time      92.5 μs <
 Memory estimate: 0 bytes, allocs estimate: 0.
1 Like