Is there a smarter way to evaluate an float power?

Hello!

Suppose I have this expression:

V = rand(10^5)
G = 1/7

@benchmark $V .^ $G
BenchmarkTools.Trial: 1871 samples with 1 evaluation.
 Range (min … max):  2.282 ms …   7.797 ms  ┊ GC (min … max): 0.00% … 65.35%
 Time  (median):     2.592 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.668 ms ± 495.967 μs  ┊ GC (mean ± σ):  1.70% ±  6.29%

       ▅  █▃▅  
  ▄▂▂▃▃█▅▄████▇▄▄▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▁▂▁▂▂▁▂▁▂▂▁▁▁▁▁▂ ▃
  2.28 ms         Histogram: frequency by time        4.18 ms <

 Memory estimate: 781.30 KiB, allocs estimate: 2.

Is there a smarter way to write this expression for a faster calculation on CPU?

The calculation is “fast” already, but when I have to perform this calculation ~1e6 times a second, it starts becoming slow.

Kind regards

on 1.9 this will be about 2x faster. also if you’re ok with some inaccuracy @fastmath should also help a bit.

1 Like

Cool!

I suppose it is an internal change, so I couldn’t copy-paste a code snippet from the beta version?

Kind regards

you totally could. it’s all pure Julia code

Okay, I think I found it here:

But rethinking perhaps not the smartest to do, since I don’t know what to copy over :sweat_smile:

Probably will wait for 1.9 then, thank you for the update and great to hear it gets to be double as quick!

if you’re curious, the changes are in minor tweak to pow accuracy and tests by oscardssmith · Pull Request #48233 · JuliaLang/julia · GitHub
faster `_log_ext` by oscardssmith · Pull Request #44717 · JuliaLang/julia · GitHub
and Fix some pow edge cases by oscardssmith · Pull Request #46412 · JuliaLang/julia · GitHub

2 Likes

You’re basically asking for a faster implementation of f(x) = x^(1//7).

The standard follow-up questions are:

  1. what domain are you interested in, exactly? I guess it’s an interval on the real line like [0,1] or [2.3, 2.7], for example. The smaller your domain, the faster your function can be.
  2. what accuracy are you interested in, exactly? I’m guessing you’re interested in either maximum relative or maximum absolute error over your domain. The less accuracy you need the faster your function can be.
  3. Show us bits of code around where you call this function. For example, if you’re going to subtract 3 (just an example) each time after calling your function, this can be accounted for.
2 Likes

Hi again! :blush:

Basically the function I am working with is:

The interval of interest is a function of a lot of different things; resolution, pressure etc. As a start I would like to see if there was optimizations avoiding that, else as a beginning [-10,10] would be of interest.

The accuracy im interested in as a start would be to two floating points decimals i.e. “any-number”.xx - this is not a hard requirement, if I get it as an integer it might be good enough too - not sure yet.

The full function where ^(1/7) is to be evaluated is:

drhop = ρ₀* ^(rh,γinv) - ρ₀

Where ρ₀ = 1000, γinv = 1/7 and rh can be set as 1 ± rand.

I hope that gives a bit more clarity :slight_smile:

Kind regards

1 Like

Just to be clear, do you mean -10 \le rh \le 10, where rh is the same parameter as in the line of code you show:

drhop = ρ₀* ^(rh,γinv) - ρ₀

I need to be clear here because the process for finding the solution takes some time, so I want to get it right.

Yes, on top of my head that range seems to make the most sense with the simulation I am working with most right now :slight_smile:

rh = 1 ± a small number

If I have -10 < rh < 10, I believe I should be safe

1 Like

if you can confirm that a small number is less than .5 a Taylor series should work well.

And @nsajko , I did a small “worst case” calculation based on what I’ve seen:

And as c0 increases, DDTgz goes down and the small number becomes even smaller.

So I don’t think it would be fair to say a range of .5, since I would like it to work for the more extreme stuff too :slight_smile:

But I think [-10,10] could have been [-5,5] if that helps

Kind regards

1 Like
julia> @time using LoopVectorization
  1.256829 seconds (4.11 M allocations: 265.956 MiB, 5.53% gc time, 6.74% compilation time)

julia> V = rand(10^5);

julia> G = 1/7;

julia> @btime $V .^ $G;
  1.351 ms (2 allocations: 781.30 KiB)

julia> @btime @fastmath $V .^ $G;
  1.350 ms (2 allocations: 781.30 KiB)

julia> @btime @turbo $V .^ $G;
  102.578 μs (2 allocations: 781.30 KiB)

julia> versioninfo()
Julia Version 1.10.0-DEV.410
Commit 62c1137e38 (2023-01-21 04:16 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: 8 × 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz

Note milliseconds vs microseconds.

8 Likes

Sorry, I give up, it turns out this would take too much time to implement.

The general idea with such approximation tasks is to implement so called “range reduction” for reducing the arguments to some smaller interval, and use a kernel function on that smaller interval. The kernel function is usually a polynomial or a rational function. I’m currently developing some packages for automatically implementing the polynomial-based kernel functions, so I thought I’ll “just” adapt the range reduction from Julia’s Base.Math, and swap in a custom kernel function. However, it turns out that the range-reduction in Base.Math for the exponential function (which ^ relies on) is somewhat difficult to comprehend. It even uses some big table of precomputed constants.

You could just use LoopVectorization.@turbo like Chris Elrod suggests.

1 Like

the big table of constants is just 2^i/256 for I in 0:255. the complicated bit is that they are stored in slightly extended precision by putting 8 extra bits of precision in the exponent fields of the table (since the high parts of the numbers are all between 1 and 2 and therefore have the same exponent). the reduction ends up being 2^x=2^k2^i2^r where k is an integer (making multiplication by 2^k into an exponent shift) i is an integer multiple of 1/256 which is stored in a table, and a polynomial for 2^r which needs very few terms since |r|<1/512

2 Likes

For a fixed power like this you can do better than the generic power code if you work hard enough (which may or may not be worth it in your case). See, for example, how cbrt (1/3 power) is implemented: julia/cbrt.jl at master · JuliaLang/julia · GitHub

3 Likes

Where would one learn how to transform 1/7 to a set of machine insttructions like for cbrt?

I would be interested in how big a speed up this would give.

1 Like

Thank you for making me aware of this!

Currently I am using some loop structure like this:

Threads.@threads for (iter_range, ichunk) in chunks(list, nchunks)
        for iter in iter_range
        # do calculations
        end
end

And simply replacing the threads part with @turbo or @tturbo does not work so I suppose I have to rewrite my loops to get that benefit you showed.

The reason I iterate like this is because it is a multi-threaded for loop where each thread is working on a chunk (using ChunkSplitters) and then it loops through the calculation in the nested for loop.

But now I know how I can get a drastic speed up on CPU, because that function where I have to use ^(1/7) is the heaviest part of the simulation as of now:

image

Where that specific calculation using that power is definitely the heaviest part (@profview):

I just wanted to show you the hope / potential impact you give me when showing this approach with @turbo. A 10x speed up of that function is really magnificent

I guess a first step would be to pull this calculation out of the main loop and calculate that heavy part for it self using @turbo?

Kind regards

There are a lot of techniques, but IIRC it involves understanding the floating point format, which lets you get a quick crude approximation by manipulating the exponent and significand separately, followed by Newton iterations to polish the root.

4 Likes
  1. In the code: Is γinv is equal to 1/7 ? Where is it set?
    Hopefully it isn’t a global variable… (and if it is, should be a const or typed as 1.8 allows). But it doesn’t make that much of a performance difference.

  2. The power is calculated for particle i and particle j, assuming this goes over all particle pairs or mostly when (i,j) it also applies to (j,i), isn’t there a 2x factor to be gained by calculating it just once for each particle.

2 Likes