I am writing a code that involves lots of use of “^” and profiling seems to suggest “^” is contributing a lot of the time.
I have done the following test of the timing. Replacing “^” “*” seems to give a massive speed up, which is not surprising. Interestingly, x^3 is much faster than x^4 and it seems that the compiler is capable of dynamically convert “^” to “*”!
f1(x) = x ^ 2
f2(x) = x ^ 1.1
f3(x) = x ^ 3
f4(x) = x ^ 4
f5(x) = x ^ 9.9
f6(x) = x ^ 5
f7(x, y) = x ^ y
f8(x) = x * x * x * x * x
@btime f1(1.5)
@btime f2(1.5)
@btime f3(1.5)
@btime f4(1.5)
@btime f5(1.5)
@btime f6(1.5)
@btime f7(1.5, 3)
@btime f7(1.5, 9.9)
@btime f8(1.5)
Unfortunately, I have to go beyond “x^3”… I am wondering if there is any tip to improve the performance further. Perhaps the solution would be to convert “^” to “*” in the equation at compile time through macros?
Note that you aren’t benchmarking anything. Subnanosecond times indicate you’re running into constant propagation: Manual · BenchmarkTools.jl. In Julia v1.8 BenchmarkTools will be able to automatically stop constant propagation.
What timings do you see in your real application? How did you come to the conclusion that ^ is the bottleneck (did you profile using e.g. the Profile stdlib)?
If you’re doing lots of ^, I’d also expect ^ to take up most of the time, to be honest.
To clarify: for literal integer powers like x^7, this will give you better performance at the cost of slightly degraded accuracy. (For the built-in hardware floating-point types you can also simply use @fastmath.)
I’m not sure what you are trying to do, but if you happen to evaluate polynomials, consider using evalpoly (aka Horner’s method) to mitigate the number of multiplications / powers
Thanks for all the suggestions!
In the end I just made a more aggressive version of (^) in Base:
@inline function fast_pow(x, y)
y == -1 && return inv(x)
y == 0 && return one(x)
y == 1 && return x
y == 2 && return x * x
y == 3 && return x * x * x
y == 4 && return x * x * x * x
y == 5 && return x * x * x * x * x
y == 6 && return x * x * x * x * x * x
y == 7 && return x * x * x * x * x * x * x
y == 8 && return x * x * x * x * x * x * x * x
y == 9 && return x * x * x * x * x * x * x * x * x
y == 10 && return x * x * x * x * x * x * x * x * x * x
y == 11 && return x * x * x * x * x * x * x * x * x * x * x
y == 12 && return x * x * x * x * x * x * x * x * x * x * x * x
^(x, y)
end
and it already gives good speedups.
Because y is only known at runtime so I cannot use FastPow.jl to statically expand it.
function f(x, y, z, p, q)
np = length(p)
nq = length(q
results = zeros(np * nq)
k = 1
for i in 1:np
for j in 1:nq
results[k] = x^p[i] * y^p[i] * z^q[j]
k += 1
end
end
end
Oh, this can be made 3x faster trivially by computing x^p[i]*y^p[i] outside the inner loop.
If you are willing to lose a little bit of accuracy, you can get another 2x speedup by saving log(z) and writing
function f(x, y, z, p, q)
np = length(p)
nq = length(q
results = zeros(np * nq)
k = 1
logz = log(z)
for i in 1:np
xyp = x^p[i] * y^p[i]
for j in 1:nq
results[k] = xyp * exp(logz*q[j])
k += 1
end
end
end
This (and your other cases) are using more multiplications than necessary. For example, x^8 = ((x*x)^2)^2, which is only 3 multiplications instead of 7. You can do return @fastpow x^8 here (or return @fastmath x^8 for hardware-float types), I guess.
But I agree with @Oscar_Smith that it’s better to optimize the larger context for these calculations.