Is there a smarter way to evaluate an float power?

Ahmed_Salih · January 21, 2023, 8:03pm

Hello!

Suppose I have this expression:

V = rand(10^5)
G = 1/7

@benchmark $V .^ $G
BenchmarkTools.Trial: 1871 samples with 1 evaluation.
 Range (min … max):  2.282 ms …   7.797 ms  ┊ GC (min … max): 0.00% … 65.35%
 Time  (median):     2.592 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.668 ms ± 495.967 μs  ┊ GC (mean ± σ):  1.70% ±  6.29%

       ▅  █▃▅  
  ▄▂▂▃▃█▅▄████▇▄▄▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▁▂▁▂▂▁▂▁▂▂▁▁▁▁▁▂ ▃
  2.28 ms         Histogram: frequency by time        4.18 ms <

 Memory estimate: 781.30 KiB, allocs estimate: 2.

Is there a smarter way to write this expression for a faster calculation on CPU?

The calculation is “fast” already, but when I have to perform this calculation ~1e6 times a second, it starts becoming slow.

Kind regards

Oscar_Smith · January 21, 2023, 8:11pm

on 1.9 this will be about 2x faster. also if you’re ok with some inaccuracy @fastmath should also help a bit.

Ahmed_Salih · January 21, 2023, 8:13pm

Cool!

I suppose it is an internal change, so I couldn’t copy-paste a code snippet from the beta version?

Kind regards

Oscar_Smith · January 21, 2023, 8:14pm

you totally could. it’s all pure Julia code

Ahmed_Salih · January 21, 2023, 8:18pm

Okay, I think I found it here:

github.com

JuliaLang/julia/blob/b8ca7cfa2c996ff12ed0de6cc61f7fad4d3dfa88/base/promotion.jl

# This file is a part of Julia. License is MIT: https://julialang.org/license

## type join (closest common ancestor, or least upper bound) ##

"""
    typejoin(T, S, ...)

Return the closest common ancestor of types `T` and `S`, i.e. the narrowest type from which
they both inherit. Recurses on additional varargs.

# Examples
```jldoctest
julia> typejoin(Int, Float64)
Real

julia> typejoin(Int, Float64, ComplexF32)
Number
```
"""
typejoin() = Bottom

This file has been truncated. show original

But rethinking perhaps not the smartest to do, since I don’t know what to copy over

Probably will wait for 1.9 then, thank you for the update and great to hear it gets to be double as quick!

Oscar_Smith · January 21, 2023, 8:22pm

if you’re curious, the changes are in minor tweak to pow accuracy and tests by oscardssmith · Pull Request #48233 · JuliaLang/julia · GitHub
faster `_log_ext` by oscardssmith · Pull Request #44717 · JuliaLang/julia · GitHub
and Fix some pow edge cases by oscardssmith · Pull Request #46412 · JuliaLang/julia · GitHub

nsajko · January 21, 2023, 9:26pm

You’re basically asking for a faster implementation of f(x) = x^(1//7).

The standard follow-up questions are:

what domain are you interested in, exactly? I guess it’s an interval on the real line like [0,1] or [2.3, 2.7], for example. The smaller your domain, the faster your function can be.
what accuracy are you interested in, exactly? I’m guessing you’re interested in either maximum relative or maximum absolute error over your domain. The less accuracy you need the faster your function can be.
Show us bits of code around where you call this function. For example, if you’re going to subtract 3 (just an example) each time after calling your function, this can be accounted for.

Ahmed_Salih · January 21, 2023, 9:36pm

Hi again!

Basically the function I am working with is:

The interval of interest is a function of a lot of different things; resolution, pressure etc. As a start I would like to see if there was optimizations avoiding that, else as a beginning [-10,10] would be of interest.

The accuracy im interested in as a start would be to two floating points decimals i.e. “any-number”.xx - this is not a hard requirement, if I get it as an integer it might be good enough too - not sure yet.

The full function where ^(1/7) is to be evaluated is:

drhop = ρ₀* ^(rh,γinv) - ρ₀

Where ρ₀ = 1000, γinv = 1/7 and rh can be set as 1 ± rand.

I hope that gives a bit more clarity

Kind regards

nsajko · January 21, 2023, 9:42pm

Just to be clear, do you mean -10 \le rh \le 10, where rh is the same parameter as in the line of code you show:

drhop = ρ₀* ^(rh,γinv) - ρ₀

I need to be clear here because the process for finding the solution takes some time, so I want to get it right.

Ahmed_Salih · January 21, 2023, 9:48pm

Yes, on top of my head that range seems to make the most sense with the simulation I am working with most right now

rh = 1 ± a small number

If I have -10 < rh < 10, I believe I should be safe

Oscar_Smith · January 21, 2023, 10:06pm

if you can confirm that a small number is less than .5 a Taylor series should work well.

Ahmed_Salih · January 21, 2023, 10:35pm

And @nsajko , I did a small “worst case” calculation based on what I’ve seen:

And as c0 increases, DDTgz goes down and the small number becomes even smaller.

So I don’t think it would be fair to say a range of .5, since I would like it to work for the more extreme stuff too

But I think [-10,10] could have been [-5,5] if that helps

Kind regards

Elrod · January 21, 2023, 11:28pm

julia> @time using LoopVectorization
  1.256829 seconds (4.11 M allocations: 265.956 MiB, 5.53% gc time, 6.74% compilation time)

julia> V = rand(10^5);

julia> G = 1/7;

julia> @btime $V .^ $G;
  1.351 ms (2 allocations: 781.30 KiB)

julia> @btime @fastmath $V .^ $G;
  1.350 ms (2 allocations: 781.30 KiB)

julia> @btime @turbo $V .^ $G;
  102.578 μs (2 allocations: 781.30 KiB)

julia> versioninfo()
Julia Version 1.10.0-DEV.410
Commit 62c1137e38 (2023-01-21 04:16 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: 8 × 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz

Note milliseconds vs microseconds.

nsajko · January 22, 2023, 1:05am

Sorry, I give up, it turns out this would take too much time to implement.

The general idea with such approximation tasks is to implement so called “range reduction” for reducing the arguments to some smaller interval, and use a kernel function on that smaller interval. The kernel function is usually a polynomial or a rational function. I’m currently developing some packages for automatically implementing the polynomial-based kernel functions, so I thought I’ll “just” adapt the range reduction from Julia’s Base.Math, and swap in a custom kernel function. However, it turns out that the range-reduction in Base.Math for the exponential function (which ^ relies on) is somewhat difficult to comprehend. It even uses some big table of precomputed constants.

You could just use LoopVectorization.@turbo like Chris Elrod suggests.

Oscar_Smith · January 22, 2023, 1:13am

the big table of constants is just 2^i/256 for I in 0:255. the complicated bit is that they are stored in slightly extended precision by putting 8 extra bits of precision in the exponent fields of the table (since the high parts of the numbers are all between 1 and 2 and therefore have the same exponent). the reduction ends up being 2^x=2^k2^i2^r where k is an integer (making multiplication by 2^k into an exponent shift) i is an integer multiple of 1/256 which is stored in a table, and a polynomial for 2^r which needs very few terms since |r|<1/512

stevengj · January 22, 2023, 1:48am

For a fixed power like this you can do better than the generic power code if you work hard enough (which may or may not be worth it in your case). See, for example, how cbrt (1/3 power) is implemented: julia/base/special/cbrt.jl at master · JuliaLang/julia · GitHub

Ahmed_Salih · January 22, 2023, 6:51am

Where would one learn how to transform 1/7 to a set of machine insttructions like for cbrt?

I would be interested in how big a speed up this would give.

Ahmed_Salih · January 22, 2023, 10:48am

Thank you for making me aware of this!

Currently I am using some loop structure like this:

Threads.@threads for (iter_range, ichunk) in chunks(list, nchunks)
        for iter in iter_range
        # do calculations
        end
end

And simply replacing the threads part with @turbo or @tturbo does not work so I suppose I have to rewrite my loops to get that benefit you showed.

The reason I iterate like this is because it is a multi-threaded for loop where each thread is working on a chunk (using ChunkSplitters) and then it loops through the calculation in the nested for loop.

But now I know how I can get a drastic speed up on CPU, because that function where I have to use ^(1/7) is the heaviest part of the simulation as of now:

Where that specific calculation using that power is definitely the heaviest part (@profview):

I just wanted to show you the hope / potential impact you give me when showing this approach with @turbo. A 10x speed up of that function is really magnificent

I guess a first step would be to pull this calculation out of the main loop and calculate that heavy part for it self using @turbo?

Kind regards

stevengj · January 22, 2023, 3:11pm

There are a lot of techniques, but IIRC it involves understanding the floating point format, which lets you get a quick crude approximation by manipulating the exponent and significand separately, followed by Newton iterations to polish the root.

Dan · January 22, 2023, 3:58pm

In the code: Is γinv is equal to 1/7 ? Where is it set?
Hopefully it isn’t a global variable… (and if it is, should be a const or typed as 1.8 allows). But it doesn’t make that much of a performance difference.
The power is calculated for particle i and particle j, assuming this goes over all particle pairs or mostly when (i,j) it also applies to (j,i), isn’t there a 2x factor to be gained by calculating it just once for each particle.

Topic		Replies	Views
BetterExp 0.1.0 Package Announcements math	30	2554	September 9, 2020
A faster pow(x,1/12) function available in Fast12thRoot.jl Numerics	12	2489	June 3, 2019
Rounding the coefficients of a minimax polynomial General Usage polynomials	19	637	January 14, 2023
Any suggestions for speeding this simple scalar function up? Performance performance	36	2256	August 22, 2021
Can this be written even faster? (CPU) Performance	34	842	February 9, 2024

Is there a smarter way to evaluate an float power?

Related topics