I’m currently working on optimizing an algorithm to match the performance of an existing Fortran implementation. A key part of the algorithm involves computing the power of large arrays of floats raised to a scalar float.
I’ve looked into the IntelVectorMath.jl and AppleAccelerate packages to optimize that. However, both packages only seem to support element-wise power operations between two vectors of floats. To use these, I would need to allocate a vector filled with the scalar value, matching the size of my base array, which seems inefficient.
My research suggests that both IntelVectorMath and AppleAccelerate could theoretically provide functions for array-to-scalar power operations, but these are not exposed in their Julia wrappers.
Given this, I am seeking advice on the most efficient way to compute the power of large float arrays by a scalar float in Julia. Are there alternative libraries or methods within Julia’s ecosystem that I should consider? Or is there a way to extend the existing wrappers to expose the functionality I need?
Any guidance or suggestions would be greatly appreciated.
Do you mean the elementwise power, i.e. x .^ p? If p is a scalar and x is an array, then it is probably faster to compute logp = log(p) and then do exp.(logp .* x). There are SIMD-accelerated functions for this kind of thing, too. Whoops, sorry, x .^ p is exp(p .* log.(x)).
Are you sure that your speed is limited by this operation alone? Usually if you have a large array then you want to do many operations per element in a single pass, in order to utilize memory efficiently. See also More dots: Why vectorized code is not as fast as it could be. From this perspective, focusing on a single “vectorized” operation in isolation is a common mistake.
I am very surprised that this is expected to be faster.
Shouldn’t it be trivial to SIMD x .^p, possibly trivial enough for the compiler to do? Or at least reasonably quick with Loop vectorization.jl? Or is there further magic that somehow makes the approach using exp and log faster? Without benchmarking I’d expect even a naively vectorized version using powers (that might even be optimized to multiplications) to be faster than computing exp and log numerically.
What am I missing?
Using log also has the disadvantage of breaking for negative elements of x.
Computing x ^ p for non-integer p is not trivial even without SIMD, and the algorithm is complicated enough that I don’t think the compiler can SIMD-ize it.
To be clear, however, while exp(x) is faster than x ^ p, computing exp(p * log(x)) is slower. I had at first mistakenly thought the log computation could be pulled out of the loop. Also, I knew there were SIMD-optimized versions of exp and log.
However, it looks like IntelVectorMath.jl has functions pow and pow! that compute x ^ p, and similarly for AppleAccelerate.jl, so I would try those.
It’s no different from x .^ p where p is not an integer, which also throws an error for negative real x.
There are various tricks to speed up computations, e.g. Fast inverse square root - Wikipedia was made famous in Doom by John Carmack.
If you share the distribution of values in x and what type of p to expect and when it is known, maybe it would fascilitate such approximation magic.