Speed of power "^"

I am writing a code that involves lots of use of “^” and profiling seems to suggest “^” is contributing a lot of the time.
I have done the following test of the timing. Replacing “^” “*” seems to give a massive speed up, which is not surprising. Interestingly, x^3 is much faster than x^4 and it seems that the compiler is capable of dynamically convert “^” to “*”!

f1(x) = x ^ 2
f2(x) = x ^ 1.1
f3(x) = x ^ 3
f4(x) = x ^ 4
f5(x) = x ^ 9.9
f6(x) = x ^ 5
f7(x, y) = x ^ y
f8(x) = x * x * x * x * x
@btime f1(1.5)
@btime f2(1.5)
@btime f3(1.5)
@btime f4(1.5)
@btime f5(1.5)
@btime f6(1.5)

@btime f7(1.5, 3)
@btime f7(1.5, 9.9)
@btime f8(1.5)
  0.011 ns (0 allocations: 0 bytes)
  0.864 ns (0 allocations: 0 bytes)
  0.011 ns (0 allocations: 0 bytes)
  0.825 ns (0 allocations: 0 bytes)
  0.825 ns (0 allocations: 0 bytes)
  0.826 ns (0 allocations: 0 bytes)
  0.011 ns (0 allocations: 0 bytes)
  0.861 ns (0 allocations: 0 bytes)
  0.011 ns (0 allocations: 0 bytes)

Unfortunately, I have to go beyond “x^3”… I am wondering if there is any tip to improve the performance further. Perhaps the solution would be to convert “^” to “*” in the equation at compile time through macros?

Note that you aren’t benchmarking anything. Subnanosecond times indicate you’re running into constant propagation: Manual · BenchmarkTools.jl. In Julia v1.8 BenchmarkTools will be able to automatically stop constant propagation.

3 Likes

Ahh I see @giordano thanks! I will check with v1.8.

Also I found that it is actually in the stdlib where the change to * is defined:

Perhaps there is a way to extended it beyond x*x*x*x in this case.

1 Like

What timings do you see in your real application? How did you come to the conclusion that ^ is the bottleneck (did you profile using e.g. the Profile stdlib)?

If you’re doing lots of ^, I’d also expect ^ to take up most of the time, to be honest.

1 Like

You can check out Base.power_by_squaring. Be aware that numerical accuracy is not quite as good, but that may not be a problem for you.

2 Likes

Julia 1.8 is much faster here. that said, powers are inherently a lot harder than multiplication.

1 Like

Also take a look at GitHub - JuliaMath/FastPow.jl: optimal addition-chain exponentiation for Julia

3 Likes

To clarify: for literal integer powers like x^7, this will give you better performance at the cost of slightly degraded accuracy. (For the built-in hardware floating-point types you can also simply use @fastmath.)

I’m not sure what you are trying to do, but if you happen to evaluate polynomials, consider using evalpoly (aka Horner’s method) to mitigate the number of multiplications / powers

1 Like

Yeah I used Pofile stdlib to do the profiling. There are lots of “^” inside loops, taking sum of different powers…

Thanks for all the suggestions!
In the end I just made a more aggressive version of (^) in Base:

@inline function fast_pow(x, y)
    y == -1 && return inv(x)
    y == 0 && return one(x)
    y == 1 && return x
    y == 2 && return x * x
    y == 3 && return x * x * x
    y == 4 && return x * x * x * x
    y == 5 && return x * x * x * x * x
    y == 6 && return x * x * x * x * x * x
    y == 7 && return x * x * x * x * x * x * x
    y == 8 && return x * x * x * x * x * x * x * x 
    y == 9 && return x * x * x * x * x * x * x * x * x 
    y == 10 && return x * x * x * x * x * x * x * x * x * x 
    y == 11 && return x * x * x * x * x * x * x * x * x * x * x 
    y == 12 && return x * x * x * x * x * x * x * x * x * x * x * x 
    ^(x, y)
end

and it already gives good speedups.
Because y is only known at runtime so I cannot use FastPow.jl to statically expand it.

Note that this will be much less accurate than ^.

2 Likes

Out of curiosity, where do you use these powers? In a polynomial?

ahh good point.

Not quite polynomial, something like:

function f(x, y, z, p, q)
    np = length(p)
    nq = length(q
    results = zeros(np * nq)
    k = 1
    for i in 1:np
        for j in 1:nq
            results[k] = x^p[i] * y^p[i] *  z^q[j]
            k += 1
        end
    end
end

Oh, this can be made 3x faster trivially by computing x^p[i]*y^p[i] outside the inner loop.

If you are willing to lose a little bit of accuracy, you can get another 2x speedup by saving log(z) and writing

function f(x, y, z, p, q)
    np = length(p)
    nq = length(q
    results = zeros(np * nq)
    k = 1
    logz = log(z)
    for i in 1:np
        xyp = x^p[i] * y^p[i] 
        for j in 1:nq
            results[k] = xyp * exp(logz*q[j])
            k += 1
        end
    end
end
11 Likes

This (and your other cases) are using more multiplications than necessary. For example, x^8 = ((x*x)^2)^2, which is only 3 multiplications instead of 7. You can do return @fastpow x^8 here (or return @fastmath x^8 for hardware-float types), I guess.

But I agree with @Oscar_Smith that it’s better to optimize the larger context for these calculations.

7 Likes