You’re basically asking for a faster implementation of f(x) = x^(1//7).
The standard follow-up questions are:
what domain are you interested in, exactly? I guess it’s an interval on the real line like [0,1] or [2.3, 2.7], for example. The smaller your domain, the faster your function can be.
what accuracy are you interested in, exactly? I’m guessing you’re interested in either maximum relative or maximum absolute error over your domain. The less accuracy you need the faster your function can be.
Show us bits of code around where you call this function. For example, if you’re going to subtract 3 (just an example) each time after calling your function, this can be accounted for.
The interval of interest is a function of a lot of different things; resolution, pressure etc. As a start I would like to see if there was optimizations avoiding that, else as a beginning [-10,10] would be of interest.
The accuracy im interested in as a start would be to two floating points decimals i.e. “any-number”.xx - this is not a hard requirement, if I get it as an integer it might be good enough too - not sure yet.
The full function where ^(1/7) is to be evaluated is:
drhop = ρ₀* ^(rh,γinv) - ρ₀
Where ρ₀ = 1000, γinv = 1/7 and rh can be set as 1 ± rand.
Sorry, I give up, it turns out this would take too much time to implement.
The general idea with such approximation tasks is to implement so called “range reduction” for reducing the arguments to some smaller interval, and use a kernel function on that smaller interval. The kernel function is usually a polynomial or a rational function. I’m currently developing some packages for automatically implementing the polynomial-based kernel functions, so I thought I’ll “just” adapt the range reduction from Julia’s Base.Math, and swap in a custom kernel function. However, it turns out that the range-reduction in Base.Math for the exponential function (which ^ relies on) is somewhat difficult to comprehend. It even uses some big table of precomputed constants.
You could just use LoopVectorization.@turbo like Chris Elrod suggests.
the big table of constants is just 2^i/256 for I in 0:255. the complicated bit is that they are stored in slightly extended precision by putting 8 extra bits of precision in the exponent fields of the table (since the high parts of the numbers are all between 1 and 2 and therefore have the same exponent). the reduction ends up being 2^x=2^k2^i2^r where k is an integer (making multiplication by 2^k into an exponent shift) i is an integer multiple of 1/256 which is stored in a table, and a polynomial for 2^r which needs very few terms since |r|<1/512
For a fixed power like this you can do better than the generic power code if you work hard enough (which may or may not be worth it in your case). See, for example, how cbrt (1/3 power) is implemented: julia/base/special/cbrt.jl at master · JuliaLang/julia · GitHub
Currently I am using some loop structure like this:
Threads.@threads for (iter_range, ichunk) in chunks(list, nchunks)
for iter in iter_range
# do calculations
end
end
And simply replacing the threads part with @turbo or @tturbo does not work so I suppose I have to rewrite my loops to get that benefit you showed.
The reason I iterate like this is because it is a multi-threaded for loop where each thread is working on a chunk (using ChunkSplitters) and then it loops through the calculation in the nested for loop.
But now I know how I can get a drastic speed up on CPU, because that function where I have to use ^(1/7) is the heaviest part of the simulation as of now:
Where that specific calculation using that power is definitely the heaviest part (@profview):
I just wanted to show you the hope / potential impact you give me when showing this approach with @turbo. A 10x speed up of that function is really magnificent
I guess a first step would be to pull this calculation out of the main loop and calculate that heavy part for it self using @turbo?
There are a lot of techniques, but IIRC it involves understanding the floating point format, which lets you get a quick crude approximation by manipulating the exponent and significand separately, followed by Newton iterations to polish the root.
In the code: Is γinv is equal to 1/7 ? Where is it set?
Hopefully it isn’t a global variable… (and if it is, should be a const or typed as 1.8 allows). But it doesn’t make that much of a performance difference.
The power is calculated for particle i and particle j, assuming this goes over all particle pairs or mostly when (i,j) it also applies to (j,i), isn’t there a 2x factor to be gained by calculating it just once for each particle.