Can this be written even faster? (CPU)

stevengj · February 8, 2024, 9:16pm

I think the point is that if the power is a literal constant, as opposed to a variable, the compiler can simplify the calculation a lot. (It should suffice to write 1/7 rather than 0.14285714285714285, because the compiler can evaluate the former to the latter.)

However, I expect you could do even better by writing a special function to compute the seventh root of 1+x, similar to how we have special functions for cbrt and sqrt. Look at the code in cbrt.jl for the implementation of cbrt, for inspiration.

For example, when you compute (1+x)^{1/7}, is x always relatively small, e.g. |x| < 0.1? If so, a very simple approach is a Taylor expansion:

# compute (1+x)^(1/7) approximately, in double precision, by a 
# Taylor series for small x
taylor7th(x) = evalpoly(x, (1, 1/7, -3/49, 13/343, -65/2401, 351/16807, -1989/117649, 81549/5764801, -489294/40353607, 2990130/282475249))

This matches (1+x)^(1/7) to machine precision for x=0.01, to 12 digits for x=0.1, and to 5 digits for x=0.5. However, it is about 5x faster.

If you need something accurate for larger x, look at cbrt.jl … basically you compute an approximation and then polish it with Newton steps, which converge very quickly.

Ahmed_Salih · February 8, 2024, 9:49pm

@Uve thanks a ton, exactly what I needed!

@stevengj unfortunately I am not that good at this kind of bit manipulation etc. Is there a ressourcce which explains and goes through for example this cube root function?`

Kind regards

Oscar_Smith · February 8, 2024, 10:21pm

Here’s a version of the optimized algorithm for x^(1/7) It’s not perfect (I think I could make it a bit faster with some extra work), but it should work for all (finite) x and is accurate to a couple floating point epsilon:

function fancy7th(x::Float64)
    # todo tune the magic constant
    # initial guess based on fast inverse sqrt trick but adjusted to compute x^(1/8)
    t = copysign(reinterpret(Float64, 0x37f306fe0a31b715 + reinterpret(UInt64,abs(x))>>3), x)
    @fastmath for _ in 1:3
        # newton's method for t^3 - x/t^4 = 0
        t2 = t*t
        t3 = t2*t
        t4 = t2*t2
        xot4 = x/t4
        t = t - t*(t3 - xot4)/(4*t3 + 3*xot4)
    end
    t
end

It is a little slower than @stevengj’s solution, but it is more accurate and works over the full range.

julia> @btime x^(1/7) setup=x=rand()
  16.985 ns (0 allocations: 0 bytes)

julia> @btime taylor7th(x) setup=x=rand()
  2.996 ns (0 allocations: 0 bytes)

julia> @btime fancy7th(x) setup=x=rand()
  7.052 ns (0 allocations: 0 bytes)

Here is an (IMO better version). This one gets within 5*10^-11 over the range and is 5.6ns (it saves 1 iteration by making a slightly better initial guess)

function fancy7th(x::Float64)
    # todo tune the magic constant
    # initial guess based on fast inverse sqrt trick but adjusted to compute x^(1/7)
    t = copysign(reinterpret(Float64, 0x36cd000000000000 + reinterpret(UInt64,abs(x))÷7), x)
    @fastmath for _ in 1:2
        # newton's method for t^3 - x/t^4 = 0
        t2 = t*t
        t3 = t2*t
        t4 = t2*t2
        xot4 = x/t4
        t = t - t*(t3 - xot4)/(4*t3 + 3*xot4)
    end
    t
end

Ahmed_Salih · February 8, 2024, 10:52pm

Oscar_Smith:

function fancy7th(x::Float64)
    # todo tune the magic constant
    # initial guess based on fast inverse sqrt trick but adjusted to compute x^(1/7)
    t = copysign(reinterpret(Float64, 0x36cd000000000000 + reinterpret(UInt64,abs(x))÷7), x)
    @fastmath for _ in 1:2
        # newton's method for t^3 - x/t^4 = 0
        t2 = t*t
        t3 = t2*t
        t4 = t2*t2
        xot4 = x/t4
        t = t - t*(t3 - xot4)/(4*t3 + 3*xot4)
    end
    t
end

Thanks a ton for the help!

I just tried rolling an example with it quickly. Unfortunately, it does not seem to work directly with LoopVectorization in Julia 1.10 atleast:

using BenchmarkTools
using StaticArrays
using LoopVectorization


# READ: https://discourse.julialang.org/t/can-this-be-written-even-faster-cpu/109924/17
# Generate the needed constants
N     = 6000
Cb    = inv(Float64(100000))
invCb = inv(Cb)
xᵢⱼᶻ  = rand(Float64,N*4)
xᵢⱼ   = rand(SVector{3,Float64},N*4)
γ⁻¹   = Float64(1/7)
ρ₀    = Float64(1000)
g     = Float64(9.81)
drhopLp = zeros(Float64, length(xᵢⱼ))
drhopLn = zeros(Float64, length(xᵢⱼ))


function fancy7th(x::Float64)
    # todo tune the magic constant
    # initial guess based on fast inverse sqrt trick but adjusted to compute x^(1/7)
    t = copysign(reinterpret(Float64, 0x36cd000000000000 + reinterpret(UInt64,abs(x))÷7), x)
    @fastmath for _ in 1:2
        # newton's method for t^3 - x/t^4 = 0
        t2 = t*t
        t3 = t2*t
        t4 = t2*t2
        xot4 = x/t4
        t = t - t*(t3 - xot4)/(4*t3 + 3*xot4)
    end
    t
end

faux(ρ₀, P, Cb, γ⁻¹)  = ρ₀ * ( ^( 1 + (P * Cb), γ⁻¹)   - 1)
faux_fancy(ρ₀, P, Cb) = ρ₀ * ( fancy7th( 1 + (P * Cb)) - 1)


function loopvectorization_approach3(invCb,xᵢⱼᶻ,ρ₀,γ⁻¹,g,drhopLp,drhopLn)
        @tturbo for iter in eachindex(xᵢⱼᶻ)
               # IF I COMMENT '* -xᵢⱼᶻ[iter]' NO ALLOCATIONS, WHY?
               Pᵢⱼᴴ  = ρ₀ * (-g) * -xᵢⱼᶻ[iter]
               Pⱼᵢᴴ  = -Pᵢⱼᴴ
               ρᵢⱼᴴ  = faux(ρ₀, Pᵢⱼᴴ, invCb, γ⁻¹)
               ρⱼᵢᴴ  = faux(ρ₀, Pⱼᵢᴴ, invCb, γ⁻¹)
               drhopLp[iter] = ρᵢⱼᴴ
               drhopLn[iter] = ρⱼᵢᴴ
        end
    end

function fancy_loopvectorization_approach3(invCb,xᵢⱼᶻ,ρ₀,g,drhopLp,drhopLn)
    @tturbo for iter in eachindex(xᵢⱼᶻ)
           # IF I COMMENT '* -xᵢⱼᶻ[iter]' NO ALLOCATIONS, WHY?
           Pᵢⱼᴴ  = ρ₀ * (-g) * -xᵢⱼᶻ[iter]
           Pⱼᵢᴴ  = -Pᵢⱼᴴ
           ρᵢⱼᴴ  = faux_fancy(ρ₀, Pᵢⱼᴴ, invCb)
           ρⱼᵢᴴ  = faux_fancy(ρ₀, Pⱼᵢᴴ, invCb)
           drhopLp[iter] = ρᵢⱼᴴ
           drhopLn[iter] = ρⱼᵢᴴ
    end
end



l3 = @benchmark loopvectorization_approach3($invCb,$xᵢⱼᶻ,$ρ₀, $γ⁻¹, $g,$drhopLp,$drhopLn)
f3 = @benchmark fancy_loopvectorization_approach3($invCb,$xᵢⱼᶻ,$ρ₀, $g,$drhopLp,$drhopLn)

display(l3)
display(f3)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  101.700 μs …  1.006 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     131.200 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   134.180 μs ± 31.032 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▁▁▂▂▃▂▂▂▂▃▃ ██▅▃▂▁▁▁                                      ▂
  ▅▄▇███████████████████████████▇▇▇▇▇▇█▇█▇▇▆▇▇▅▆▅▆▆▅▆▅▅▄▅▅▄▁▃▅ █
  102 μs        Histogram: log(frequency) by time       212 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  420.400 μs …  6.455 ms  ┊ GC (min … max): 0.00% … 92.58%
 Time  (median):     433.400 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   445.478 μs ± 76.715 μs  ┊ GC (mean ± σ):  0.13% ±  0.93%

  ▁▆█▆▄▃▂▂▁▁                                                   ▂
  █████████████████▇█▇▇▇▇▆▆▆▅▇▆▅▅▆▆▆▅▆▅▅▅▅▅▄▅▁▅▅▄▄▅▅▄▄▄▄▃▅▁▄▄▅ █
  420 μs        Histogram: log(frequency) by time       724 μs <

 Memory estimate: 5.84 KiB, allocs estimate: 79.

Because of the error, the bottom result allocates I think…

As soon as I get your fancy7th to work inside of @tturbo I should see a double speed up

Kind regards

mikmoore · February 8, 2024, 11:00pm

I doubt what I’m about to suggest is faster than the other things being suggested, but the following is definitely more accurate for small values of P * Cb. The original writing can suffer from catastrophic cancellation.

expm1(γ⁻¹ * log1p(P * Cb))

If you do something like Taylor series or like the fancy7th suggested above, it definitely makes sense to pull the ±1s you have into those (which would have the same accuracy benefit as expm1/log1p provide).

Ahmed_Salih · February 8, 2024, 11:10pm

Thanks @mikmoore ! I’ve made sure to implement this. I’ve never seen these functions before in Julia haha, but from the ? I could see it is quite clever to e^(x-1) and then log1p since it all reduces to that expression you showed above.

It did not make code faster, perhaps a little bit slower - but good to be more precise, especially around the small numbers.

Kind regards

Elrod · February 9, 2024, 12:14am

I get

julia> include("/home/chriselrod/Documents/progwork/julia/scratch.jl")
WARNING: using BenchmarkTools.params in module Main conflicts with an existing identifier.
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  16.497 μs … 261.852 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     16.822 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   19.193 μs ±   6.929 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▃   ▄▂▃▂▁ ▁ ▁ ▃                           ▃                 ▁
  ██▅▅▆███████████▇▆▅▅▆▆██▆█▅▆▄▄▅▃▅▄▁▅▄▄▄▃▅▅▅█▇▄▄▅▅▁▄▃▃▄▃▄▄▇▅▆ █
  16.5 μs       Histogram: log(frequency) by time      45.3 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   9.986 μs … 54.222 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     10.250 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   11.109 μs ±  2.885 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆█▄                     ▄▃▁                                 ▁
  ███▆▅▃▄▃▁▁▃▄▁▃▆▆▄▄▄▅▅▅▅▅█████▆▇▇▆▆▇▆▆▆▅▅▅▅▄▃▅▅▃▄▅▅▅▄▄▄▆▆▆▇▆ █
  9.99 μs      Histogram: log(frequency) by time      21.8 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> versioninfo()
Julia Version 1.10.0
Commit 4745ef8816 (2024-02-06 13:44 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: 36 × Intel(R) Core(TM) i9-7980XE CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, skylake-avx512)
Threads: 36 default, 0 interactive, 18 GC (on 36 virtual cores)

Script

using BenchmarkTools
using StaticArrays
using LoopVectorization


# READ: https://discourse.julialang.org/t/can-this-be-written-even-faster-cpu/109924/17
# Generate the needed constants
N     = 6000
Cb    = inv(Float64(100000))
invCb = inv(Cb)
xᵢⱼᶻ  = rand(Float64,N*4)
xᵢⱼ   = rand(SVector{3,Float64},N*4)
γ⁻¹   = Float64(1/7)
ρ₀    = Float64(1000)
g     = Float64(9.81)
drhopLp = zeros(Float64, length(xᵢⱼ))
drhopLn = zeros(Float64, length(xᵢⱼ))


@inline function fancy7th(x)
    # todo tune the magic constant
    # initial guess based on fast inverse sqrt trick but adjusted to compute x^(1/7)
    t = copysign(reinterpret(Float64, 0x36cd000000000000 + reinterpret(UInt64,abs(x))÷7), x)
    @fastmath for _ in 1:2
        # newton's method for t^3 - x/t^4 = 0
        t2 = t*t
        t3 = t2*t
        t4 = t2*t2
        xot4 = x/t4
        t = t - t*(t3 - xot4)/(4*t3 + 3*xot4)
    end
    t
end

@inline faux(ρ₀, P, Cb, γ⁻¹)  = ρ₀ * ( ^( 1 + (P * Cb), γ⁻¹)   - 1)
@inline faux_fancy(ρ₀, P, Cb) = ρ₀ * ( fancy7th( 1 + (P * Cb)) - 1)
function loopvectorization_approach3(invCb,xᵢⱼᶻ,ρ₀,γ⁻¹,g,drhopLp,drhopLn)
        @tturbo for iter in eachindex(xᵢⱼᶻ)
               # IF I COMMENT '* -xᵢⱼᶻ[iter]' NO ALLOCATIONS, WHY?
               Pᵢⱼᴴ  = ρ₀ * (-g) * -xᵢⱼᶻ[iter]
               Pⱼᵢᴴ  = -Pᵢⱼᴴ
               ρᵢⱼᴴ  = faux(ρ₀, Pᵢⱼᴴ, invCb, γ⁻¹)
               ρⱼᵢᴴ  = faux(ρ₀, Pⱼᵢᴴ, invCb, γ⁻¹)
               drhopLp[iter] = ρᵢⱼᴴ
               drhopLn[iter] = ρⱼᵢᴴ
        end
    end

function fancy_loopvectorization_approach3(invCb,xᵢⱼᶻ,ρ₀,g,drhopLp,drhopLn)
    @tturbo for iter in eachindex(xᵢⱼᶻ)
           # IF I COMMENT '* -xᵢⱼᶻ[iter]' NO ALLOCATIONS, WHY?
           Pᵢⱼᴴ  = ρ₀ * (-g) * -xᵢⱼᶻ[iter]
           Pⱼᵢᴴ  = -Pᵢⱼᴴ
           ρᵢⱼᴴ  = faux_fancy(ρ₀, Pᵢⱼᴴ, invCb)
           ρⱼᵢᴴ  = faux_fancy(ρ₀, Pⱼᵢᴴ, invCb)
           drhopLp[iter] = ρᵢⱼᴴ
           drhopLn[iter] = ρⱼᵢᴴ
    end
end

l3 = @benchmark loopvectorization_approach3($invCb,$xᵢⱼᶻ,$ρ₀, $γ⁻¹, $g,$drhopLp,$drhopLn)
f3 = @benchmark fancy_loopvectorization_approach3($invCb,$xᵢⱼᶻ,$ρ₀, $g,$drhopLp,$drhopLn)

display(l3);
display(f3);

On another server

BenchmarkTools.Trial: 318 samples with 1 evaluation.
 Range (min … max):  1.083 ms … 226.018 ms  ┊ GC (min … max):  0.00% … 90.42%
 Time  (median):     1.146 ms               ┊ GC (median):     0.00%
 Time  (mean ± σ):   3.139 ms ±  20.050 ms  ┊ GC (mean ± σ):  59.79% ±  9.30%

  █▇▂                                                          
  ███▇▇▇▄▆▁▁▁▄▁▁▄▁▄▁▁▄▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▆
  1.08 ms      Histogram: log(frequency) by time      4.62 ms <

 Memory estimate: 15.38 MiB, allocs estimate: 432000.

julia> display(f3);
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   7.530 μs …  3.756 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     11.570 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   11.908 μs ± 37.459 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                              ▃▂  █▇           
  ▂▁▂▂▂▁▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▃▄▃▂▃▃▄██▇███▇▅▃▃▃▃▃▂▂ ▃
  7.53 μs         Histogram: frequency by time        12.5 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> versioninfo()
Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 × AMD EPYC 7513 32-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
  Threads: 95 on 64 virtual cores

the allocations don’t show up with AVX512, which is why the 7980xe didn’t have that problem.

Elrod · February 9, 2024, 12:33am

Anyone know why we might have

%227 =  = < uncached > convert(::Type{VectorizationBase.Vec{4, Float64}},::VectorizationBase.Vec{4, Int64})::Any

? Why is it “uncached”?

Ahmed_Salih · February 9, 2024, 8:08am

Okay, so I understand that it is also hardware specific. Then unfortunately I think for now I have to stick with the raw implementation of ^(1/7).

JM_Beckers · February 9, 2024, 8:12am

Ahmed_Salih:

unction loopvectorization_approach(Cb,xᵢⱼᶻ,ρ₀,γ⁻¹,g,drhopLp,drhopLn)
    @tturbo for iter in eachindex(xᵢⱼᶻ)
        # IF I COMMENT '* -xᵢⱼᶻ[iter]' NO ALLOCATIONS, WHY?
        Pᵢⱼᴴ  = ρ₀ * (-g) * -xᵢⱼᶻ[iter]
        ρᵢⱼᴴ  = ρ₀ * ( ^( 1 + (Pᵢⱼᴴ * Cb), γ⁻¹) - 1)
        Pⱼᵢᴴ  = -Pᵢⱼᴴ
        ρⱼᵢᴴ  = ρ₀ * ( ^( 1 + (Pⱼᵢᴴ * Cb), γ⁻¹) - 1)
        
        drhopLp[iter] = ρᵢⱼᴴ
        drhopLn[iter] = ρⱼᵢᴴ
    end
end

I would suggest to work with non-dimensional variables earlier in your problem: for example with the notations from the above version, I would define

\tilde{P}=P*Cb

\tilde{\rho}=1+{\rho \over \rho_0}

\tilde{x}= Cb * \rho_0 * g * x

and work with those varibles (the physical parameters most likely will then combine into a few non-dimensional parameters earlier in your code)

Then your time critical function would read for the non-dimensional variables something like

  # IF I COMMENT '* -xᵢⱼᶻ[iter]' NO ALLOCATIONS, WHY?
        Pᵢⱼᴴ  = xᵢⱼᶻ[iter]
        ρᵢⱼᴴ  =  ^( 1 + Pᵢⱼᴴ , γ⁻¹) 
        Pⱼᵢᴴ  = -Pᵢⱼᴴ
        ρⱼᵢᴴ  = ^( 1 + Pⱼᵢᴴ , γ⁻¹) 
        
        drhopLp[iter] = ρᵢⱼᴴ
        drhopLn[iter] = ρⱼᵢᴴ
    end
end

and would also avoid catastrophic cancellations for small P, if you can continue to work with \tilde{\rho} instead of \rho.

Elrod · February 9, 2024, 10:59am

Do you mean stick with the fancy implementation?
Using ^ resulted in allocations for me on one system.
The fancy implementation was faster on both.

Ahmed_Salih · February 9, 2024, 12:31pm

Hmmm that is weird, the fancy implementation would not even run for me… and it did allocate…

Will have to recheck later then, perhaps I was working while tired

Kind regards

Ahmed_Salih · February 9, 2024, 12:32pm

Why is it catastrophic for small P?

Kind regards

Elrod · February 9, 2024, 12:48pm

Did you look at my code? I fixed it. It is hidden under script>.Can this be written even faster? (CPU) - #28 by Elrod

Ahmed_Salih · February 9, 2024, 2:24pm

Thank you!

That did work, incredible. And thanks to @stevengj and @Oscar_Smith for producing the specific function

In the past I would be at ~30ish seconds for each DDT step now I see:

16-18 s, so that is almost a 50% time saving. Nice!

Kind regards