Can this be written even faster? (CPU)

I think the point is that if the power is a literal constant, as opposed to a variable, the compiler can simplify the calculation a lot. (It should suffice to write 1/7 rather than 0.14285714285714285, because the compiler can evaluate the former to the latter.)

However, I expect you could do even better by writing a special function to compute the seventh root of 1+x, similar to how we have special functions for cbrt and sqrt. Look at the code in cbrt.jl for the implementation of cbrt, for inspiration.

For example, when you compute (1+x)^{1/7}, is x always relatively small, e.g. |x| < 0.1? If so, a very simple approach is a Taylor expansion:

# compute (1+x)^(1/7) approximately, in double precision, by a 
# Taylor series for small x
taylor7th(x) = evalpoly(x, (1, 1/7, -3/49, 13/343, -65/2401, 351/16807, -1989/117649, 81549/5764801, -489294/40353607, 2990130/282475249))

This matches (1+x)^(1/7) to machine precision for x=0.01, to 12 digits for x=0.1, and to 5 digits for x=0.5. However, it is about 5x faster.

If you need something accurate for larger x, look at cbrt.jl … basically you compute an approximation and then polish it with Newton steps, which converge very quickly.

@Uve thanks a ton, exactly what I needed!

@stevengj unfortunately I am not that good at this kind of bit manipulation etc. Is there a ressourcce which explains and goes through for example this cube root function?`

Kind regards

Here’s a version of the optimized algorithm for x^(1/7) It’s not perfect (I think I could make it a bit faster with some extra work), but it should work for all (finite) x and is accurate to a couple floating point epsilon:

function fancy7th(x::Float64)
    # todo tune the magic constant
    # initial guess based on fast inverse sqrt trick but adjusted to compute x^(1/8)
    t = copysign(reinterpret(Float64, 0x37f306fe0a31b715 + reinterpret(UInt64,abs(x))>>3), x)
    @fastmath for _ in 1:3
        # newton's method for t^3 - x/t^4 = 0
        t2 = t*t
        t3 = t2*t
        t4 = t2*t2
        xot4 = x/t4
        t = t - t*(t3 - xot4)/(4*t3 + 3*xot4)
    end
    t
end

It is a little slower than @stevengj’s solution, but it is more accurate and works over the full range.

julia> @btime x^(1/7) setup=x=rand()
  16.985 ns (0 allocations: 0 bytes)

julia> @btime taylor7th(x) setup=x=rand()
  2.996 ns (0 allocations: 0 bytes)

julia> @btime fancy7th(x) setup=x=rand()
  7.052 ns (0 allocations: 0 bytes)

Here is an (IMO better version). This one gets within 5*10^-11 over the range and is 5.6ns (it saves 1 iteration by making a slightly better initial guess)

function fancy7th(x::Float64)
    # todo tune the magic constant
    # initial guess based on fast inverse sqrt trick but adjusted to compute x^(1/7)
    t = copysign(reinterpret(Float64, 0x36cd000000000000 + reinterpret(UInt64,abs(x))÷7), x)
    @fastmath for _ in 1:2
        # newton's method for t^3 - x/t^4 = 0
        t2 = t*t
        t3 = t2*t
        t4 = t2*t2
        xot4 = x/t4
        t = t - t*(t3 - xot4)/(4*t3 + 3*xot4)
    end
    t
end

Thanks a ton for the help!

I just tried rolling an example with it quickly. Unfortunately, it does not seem to work directly with LoopVectorization in Julia 1.10 atleast:

using BenchmarkTools
using StaticArrays
using LoopVectorization


# READ: https://discourse.julialang.org/t/can-this-be-written-even-faster-cpu/109924/17
# Generate the needed constants
N     = 6000
Cb    = inv(Float64(100000))
invCb = inv(Cb)
xᵢⱼᶻ  = rand(Float64,N*4)
xᵢⱼ   = rand(SVector{3,Float64},N*4)
γ⁻¹   = Float64(1/7)
ρ₀    = Float64(1000)
g     = Float64(9.81)
drhopLp = zeros(Float64, length(xᵢⱼ))
drhopLn = zeros(Float64, length(xᵢⱼ))


function fancy7th(x::Float64)
    # todo tune the magic constant
    # initial guess based on fast inverse sqrt trick but adjusted to compute x^(1/7)
    t = copysign(reinterpret(Float64, 0x36cd000000000000 + reinterpret(UInt64,abs(x))÷7), x)
    @fastmath for _ in 1:2
        # newton's method for t^3 - x/t^4 = 0
        t2 = t*t
        t3 = t2*t
        t4 = t2*t2
        xot4 = x/t4
        t = t - t*(t3 - xot4)/(4*t3 + 3*xot4)
    end
    t
end

faux(ρ₀, P, Cb, γ⁻¹)  = ρ₀ * ( ^( 1 + (P * Cb), γ⁻¹)   - 1)
faux_fancy(ρ₀, P, Cb) = ρ₀ * ( fancy7th( 1 + (P * Cb)) - 1)


function loopvectorization_approach3(invCb,xᵢⱼᶻ,ρ₀,γ⁻¹,g,drhopLp,drhopLn)
        @tturbo for iter in eachindex(xᵢⱼᶻ)
               # IF I COMMENT '* -xᵢⱼᶻ[iter]' NO ALLOCATIONS, WHY?
               Pᵢⱼᴴ  = ρ₀ * (-g) * -xᵢⱼᶻ[iter]
               Pⱼᵢᴴ  = -Pᵢⱼᴴ
               ρᵢⱼᴴ  = faux(ρ₀, Pᵢⱼᴴ, invCb, γ⁻¹)
               ρⱼᵢᴴ  = faux(ρ₀, Pⱼᵢᴴ, invCb, γ⁻¹)
               drhopLp[iter] = ρᵢⱼᴴ
               drhopLn[iter] = ρⱼᵢᴴ
        end
    end

function fancy_loopvectorization_approach3(invCb,xᵢⱼᶻ,ρ₀,g,drhopLp,drhopLn)
    @tturbo for iter in eachindex(xᵢⱼᶻ)
           # IF I COMMENT '* -xᵢⱼᶻ[iter]' NO ALLOCATIONS, WHY?
           Pᵢⱼᴴ  = ρ₀ * (-g) * -xᵢⱼᶻ[iter]
           Pⱼᵢᴴ  = -Pᵢⱼᴴ
           ρᵢⱼᴴ  = faux_fancy(ρ₀, Pᵢⱼᴴ, invCb)
           ρⱼᵢᴴ  = faux_fancy(ρ₀, Pⱼᵢᴴ, invCb)
           drhopLp[iter] = ρᵢⱼᴴ
           drhopLn[iter] = ρⱼᵢᴴ
    end
end



l3 = @benchmark loopvectorization_approach3($invCb,$xᵢⱼᶻ,$ρ₀, $γ⁻¹, $g,$drhopLp,$drhopLn)
f3 = @benchmark fancy_loopvectorization_approach3($invCb,$xᵢⱼᶻ,$ρ₀, $g,$drhopLp,$drhopLn)

display(l3)
display(f3)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  101.700 μs …  1.006 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     131.200 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   134.180 μs ± 31.032 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▁▁▂▂▃▂▂▂▂▃▃ ██▅▃▂▁▁▁                                      ▂
  ▅▄▇███████████████████████████▇▇▇▇▇▇█▇█▇▇▆▇▇▅▆▅▆▆▅▆▅▅▄▅▅▄▁▃▅ █
  102 μs        Histogram: log(frequency) by time       212 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  420.400 μs …  6.455 ms  ┊ GC (min … max): 0.00% … 92.58%
 Time  (median):     433.400 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   445.478 μs ± 76.715 μs  ┊ GC (mean ± σ):  0.13% ±  0.93%

  ▁▆█▆▄▃▂▂▁▁                                                   ▂
  █████████████████▇█▇▇▇▇▆▆▆▅▇▆▅▅▆▆▆▅▆▅▅▅▅▅▄▅▁▅▅▄▄▅▅▄▄▄▄▃▅▁▄▄▅ █
  420 μs        Histogram: log(frequency) by time       724 μs <

 Memory estimate: 5.84 KiB, allocs estimate: 79.

Because of the error, the bottom result allocates I think…

As soon as I get your fancy7th to work inside of @tturbo I should see a double speed up :slight_smile:

Kind regards

I doubt what I’m about to suggest is faster than the other things being suggested, but the following is definitely more accurate for small values of P * Cb. The original writing can suffer from catastrophic cancellation.

expm1(γ⁻¹ * log1p(P * Cb))

If you do something like Taylor series or like the fancy7th suggested above, it definitely makes sense to pull the ±1s you have into those (which would have the same accuracy benefit as expm1/log1p provide).

Thanks @mikmoore ! I’ve made sure to implement this. I’ve never seen these functions before in Julia haha, but from the ? I could see it is quite clever to e^(x-1) and then log1p since it all reduces to that expression you showed above.

It did not make code faster, perhaps a little bit slower - but good to be more precise, especially around the small numbers.

Kind regards

I get

julia> include("/home/chriselrod/Documents/progwork/julia/scratch.jl")
WARNING: using BenchmarkTools.params in module Main conflicts with an existing identifier.
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  16.497 μs … 261.852 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     16.822 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   19.193 μs ±   6.929 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▃   ▄▂▃▂▁ ▁ ▁ ▃                           ▃                 ▁
  ██▅▅▆███████████▇▆▅▅▆▆██▆█▅▆▄▄▅▃▅▄▁▅▄▄▄▃▅▅▅█▇▄▄▅▅▁▄▃▃▄▃▄▄▇▅▆ █
  16.5 μs       Histogram: log(frequency) by time      45.3 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   9.986 μs … 54.222 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     10.250 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   11.109 μs ±  2.885 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆█▄                     ▄▃▁                                 ▁
  ███▆▅▃▄▃▁▁▃▄▁▃▆▆▄▄▄▅▅▅▅▅█████▆▇▇▆▆▇▆▆▆▅▅▅▅▄▃▅▅▃▄▅▅▅▄▄▄▆▆▆▇▆ █
  9.99 μs      Histogram: log(frequency) by time      21.8 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> versioninfo()
Julia Version 1.10.0
Commit 4745ef8816 (2024-02-06 13:44 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: 36 × Intel(R) Core(TM) i9-7980XE CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, skylake-avx512)
Threads: 36 default, 0 interactive, 18 GC (on 36 virtual cores)
Script
using BenchmarkTools
using StaticArrays
using LoopVectorization


# READ: https://discourse.julialang.org/t/can-this-be-written-even-faster-cpu/109924/17
# Generate the needed constants
N     = 6000
Cb    = inv(Float64(100000))
invCb = inv(Cb)
xᵢⱼᶻ  = rand(Float64,N*4)
xᵢⱼ   = rand(SVector{3,Float64},N*4)
γ⁻¹   = Float64(1/7)
ρ₀    = Float64(1000)
g     = Float64(9.81)
drhopLp = zeros(Float64, length(xᵢⱼ))
drhopLn = zeros(Float64, length(xᵢⱼ))


@inline function fancy7th(x)
    # todo tune the magic constant
    # initial guess based on fast inverse sqrt trick but adjusted to compute x^(1/7)
    t = copysign(reinterpret(Float64, 0x36cd000000000000 + reinterpret(UInt64,abs(x))÷7), x)
    @fastmath for _ in 1:2
        # newton's method for t^3 - x/t^4 = 0
        t2 = t*t
        t3 = t2*t
        t4 = t2*t2
        xot4 = x/t4
        t = t - t*(t3 - xot4)/(4*t3 + 3*xot4)
    end
    t
end

@inline faux(ρ₀, P, Cb, γ⁻¹)  = ρ₀ * ( ^( 1 + (P * Cb), γ⁻¹)   - 1)
@inline faux_fancy(ρ₀, P, Cb) = ρ₀ * ( fancy7th( 1 + (P * Cb)) - 1)
function loopvectorization_approach3(invCb,xᵢⱼᶻ,ρ₀,γ⁻¹,g,drhopLp,drhopLn)
        @tturbo for iter in eachindex(xᵢⱼᶻ)
               # IF I COMMENT '* -xᵢⱼᶻ[iter]' NO ALLOCATIONS, WHY?
               Pᵢⱼᴴ  = ρ₀ * (-g) * -xᵢⱼᶻ[iter]
               Pⱼᵢᴴ  = -Pᵢⱼᴴ
               ρᵢⱼᴴ  = faux(ρ₀, Pᵢⱼᴴ, invCb, γ⁻¹)
               ρⱼᵢᴴ  = faux(ρ₀, Pⱼᵢᴴ, invCb, γ⁻¹)
               drhopLp[iter] = ρᵢⱼᴴ
               drhopLn[iter] = ρⱼᵢᴴ
        end
    end

function fancy_loopvectorization_approach3(invCb,xᵢⱼᶻ,ρ₀,g,drhopLp,drhopLn)
    @tturbo for iter in eachindex(xᵢⱼᶻ)
           # IF I COMMENT '* -xᵢⱼᶻ[iter]' NO ALLOCATIONS, WHY?
           Pᵢⱼᴴ  = ρ₀ * (-g) * -xᵢⱼᶻ[iter]
           Pⱼᵢᴴ  = -Pᵢⱼᴴ
           ρᵢⱼᴴ  = faux_fancy(ρ₀, Pᵢⱼᴴ, invCb)
           ρⱼᵢᴴ  = faux_fancy(ρ₀, Pⱼᵢᴴ, invCb)
           drhopLp[iter] = ρᵢⱼᴴ
           drhopLn[iter] = ρⱼᵢᴴ
    end
end

l3 = @benchmark loopvectorization_approach3($invCb,$xᵢⱼᶻ,$ρ₀, $γ⁻¹, $g,$drhopLp,$drhopLn)
f3 = @benchmark fancy_loopvectorization_approach3($invCb,$xᵢⱼᶻ,$ρ₀, $g,$drhopLp,$drhopLn)

display(l3);
display(f3);

On another server

BenchmarkTools.Trial: 318 samples with 1 evaluation.
 Range (min … max):  1.083 ms … 226.018 ms  ┊ GC (min … max):  0.00% … 90.42%
 Time  (median):     1.146 ms               ┊ GC (median):     0.00%
 Time  (mean ± σ):   3.139 ms ±  20.050 ms  ┊ GC (mean ± σ):  59.79% ±  9.30%

  █▇▂                                                          
  ███▇▇▇▄▆▁▁▁▄▁▁▄▁▄▁▁▄▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▆
  1.08 ms      Histogram: log(frequency) by time      4.62 ms <

 Memory estimate: 15.38 MiB, allocs estimate: 432000.

julia> display(f3);
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   7.530 μs …  3.756 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     11.570 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   11.908 μs ± 37.459 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                              ▃▂  █▇           
  ▂▁▂▂▂▁▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▃▄▃▂▃▃▄██▇███▇▅▃▃▃▃▃▂▂ ▃
  7.53 μs         Histogram: frequency by time        12.5 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> versioninfo()
Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 × AMD EPYC 7513 32-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
  Threads: 95 on 64 virtual cores

the allocations don’t show up with AVX512, which is why the 7980xe didn’t have that problem.

Anyone know why we might have

%227 =  = < uncached > convert(::Type{VectorizationBase.Vec{4, Float64}},::VectorizationBase.Vec{4, Int64})::Any

? Why is it “uncached”?

Okay, so I understand that it is also hardware specific. Then unfortunately I think for now I have to stick with the raw implementation of ^(1/7).

I would suggest to work with non-dimensional variables earlier in your problem: for example with the notations from the above version, I would define

\tilde{P}=P*Cb

\tilde{\rho}=1+{\rho \over \rho_0}

\tilde{x}= Cb * \rho_0 * g * x

and work with those varibles (the physical parameters most likely will then combine into a few non-dimensional parameters earlier in your code)

Then your time critical function would read for the non-dimensional variables something like

  # IF I COMMENT '* -xᵢⱼᶻ[iter]' NO ALLOCATIONS, WHY?
        Pᵢⱼᴴ  = xᵢⱼᶻ[iter]
        ρᵢⱼᴴ  =  ^( 1 + Pᵢⱼᴴ , γ⁻¹) 
        Pⱼᵢᴴ  = -Pᵢⱼᴴ
        ρⱼᵢᴴ  = ^( 1 + Pⱼᵢᴴ , γ⁻¹) 
        
        drhopLp[iter] = ρᵢⱼᴴ
        drhopLn[iter] = ρⱼᵢᴴ
    end
end

and would also avoid catastrophic cancellations for small P, if you can continue to work with \tilde{\rho} instead of \rho.

Do you mean stick with the fancy implementation?
Using ^ resulted in allocations for me on one system.
The fancy implementation was faster on both.

Hmmm that is weird, the fancy implementation would not even run for me… and it did allocate…

Will have to recheck later then, perhaps I was working while tired :slight_smile:

Kind regards

Why is it catastrophic for small P?

Kind regards

Did you look at my code? I fixed it. It is hidden under script>.Can this be written even faster? (CPU) - #28 by Elrod

Thank you!

That did work, incredible. And thanks to @stevengj and @Oscar_Smith for producing the specific function :slight_smile:

In the past I would be at ~30ish seconds for each DDT step now I see:

16-18 s, so that is almost a 50% time saving. Nice!

Kind regards