Should `literal_pow` optimize for bigger exponents?

Base.literal_pow allows x^2 and x^3 to be turned into x*x and x*x*x, turning it into very efficient machine code. I was wondering if it could be extended to higher exponents. The function I used for this is this one:

Base.literal_pow(::typeof(^),x::Number,n::Val{N}) where N = prod(ntuple(Returns(x),Val(N)))

I found that for exponents up to x^32, the compiled versions are faster, but the compile time gets longer and longer:


belapsed

Code used to generate the data
using BenchmarkTools, ProgressMeter

#Base.literal_pow(::typeof(^),x::Number,n::Val{N}) where N = prod(ntuple(Returns(x),Val(N)))

t1 = Float64[]
t2 = Float64[]

@showprogress for n in 0:50
	@eval f(x) = x^$n
	a = 2
	push!(t1, @elapsed f(2))
	push!(t2, @belapsed f($(Ref(a))[]))
end
My versioninfo()
julia> versioninfo()
Julia Version 1.7.2
Commit bf53498635 (2022-02-06 15:21 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)

up to x^26, the compiled code is very efficient:

julia> @code_llvm (x->x^26)(2)
;  @ REPL[9]:1 within `#23`
; Function Attrs: uwtable
define i64 @"julia_#23_1634"(i64 signext %0) #0 {
top:
; ┌ @ C:\Users\mittel\juliastuff\literal_pow.jl:3 within `literal_pow`
; │┌ @ tuple.jl:499 within `prod`
; ││┌ @ operators.jl:655 within `*`
; │││┌ @ operators.jl:634 within `afoldl`
; ││││┌ @ int.jl:88 within `*`
       %1 = mul i64 %0, %0
       %2 = mul i64 %1, %0
       %3 = mul i64 %2, %2
       %4 = mul i64 %3, %3
       %5 = mul i64 %4, %0
       %6 = mul i64 %5, %5
; └└└└└
  ret i64 %6
}

between x^27 and x^32 the compiled code is more complicated but apparently it is still ok, but after x^33 it’s catastrophical.

Would it make sense to extend this? (up to 5? 10? 26?)

Also I don’t know how it behaves on other architectures or with other types (Float64, Float32, Int32…)

edit: corrected benchmark

This doesn’t work because it is too inaccurate. However, in 1.8, pow with integer exponents got a bunch faster (it uses compensated power by squaring).

2 Likes

do you mean for floats? it shouldn’t be a problem for integers?

Oh, sorry. I missed that you were timing integers. This still seems really weird. Are you sure your benchmark isn’t just constant folding the whole benchmark away?

maybe…

the @code_llvm looks optimized and beautiful:

julia> @code_llvm (x->x^23)(2)
;  @ REPL[23]:1 within `#45`
; Function Attrs: uwtable
define i64 @"julia_#45_1838"(i64 signext %0) #0 {
top:
; ┌ @ C:\Users\mittel\juliastuff\literal_pow.jl:3 within `literal_pow`
; │┌ @ tuple.jl:499 within `prod`
; ││┌ @ operators.jl:655 within `*`
; │││┌ @ operators.jl:631 within `afoldl`
; ││││┌ @ int.jl:88 within `*`
       %1 = mul i64 %0, %0  # x^2
       %2 = mul i64 %1, %1  # x^4
       %3 = mul i64 %2, %0  # x^5
       %4 = mul i64 %3, %3  # x^10
       %5 = mul i64 %4, %0  # x^11
       %6 = mul i64 %5, %0  # x^12
       %7 = mul i64 %6, %5  # x^23
; └└└└└
  ret i64 %7
}

It’s definitely constant folded. There’s no such thing as pikosecond operations on today’s computers. There’s some discussion about this and how to avoid it in the BenchmarkTools README.

4 Likes

indeed

with this corrected code
using BenchmarkTools, ProgressMeter

#Base.literal_pow(::typeof(^),x::Number,n::Val{N}) where N = prod(ntuple(Returns(x),Val(N)))

t1 = Float64[]
t2 = Float64[]

@showprogress for n in 0:50
	@eval f(x) = x^$n
	a = 2
	push!(t1, @elapsed f(2))
	push!(t2, @belapsed f($(Ref(a))[]))
end

belapsed
belapsedlog

1 Like

I think we might be able to get the same result (without the performance cliff) by doing more aggressive constant propagation here.

prod(ntuple(...)) is a very suboptimal way to compute large powers — you want to use repeated squaring, or more generally an optimal addition chain. This is implemented in:

but it isn’t the default for literal_pow because it is slightly less accurate (for floating-point types).

LLVM only does this by default for integer types; for floating-point types you have to use @fastmath because it changes (worsens) the roundoff errors. The FastPow package extends this to other types beyond the small set of built-in types supported by LLVM.

3 Likes

If you look at the generated code, it looks like LLVM is smart enough to figure this out by itself.

only up to 26

Benchmarking powers for integer types is pretty misleading.

In practice, you are almost always going to use floating-point types in calculations involving large exponents, to avoid overflow (unless you are doing number theory, in which case you will be using BigInt). And LLVM optimizes products of floating point numbers very differently because floating-point multiplication is not associative.

2 Likes

In particular, here is the result (on my 2016 Intel laptop) with @fastpow, benchmarking for floating-point exponentiation:

benchmark code
using BenchmarkTools, ProgressMeter, FastPow

t_old = Float64[]
t_new = Float64[]

@showprogress for n in 0:50
	@eval f(x) = x^$n
	@eval @fastpow g(x) = x^$n
	a = 2.0
	push!(t_old, @belapsed f($(Ref(a))[]))
	push!(t_new, @belapsed g($(Ref(a))[]))
end

See also https://github.com/JuliaLang/julia/pull/44106 which should allow better constant prop through ^

I found an improvement for BigInt up to 5, for Complex{Int64} up to 6, and nothing for Float64 compared to @fastmath

bigint
complex
float64

As I said above, LLVM knows how to optimize small-integer exponents for a handful of built-in types, including floating-point types if you use @fastmath, but won’t do anything for more general types (including complex numbers).

For BigInt, in general I wouldn’t expect any improvement from unrolling, since the arithmetic is so expensive there that the underlying GMP library can already afford to do something pretty good, and the optimal algorithm is more complicated. (I can’t reproduce your improvement for small powers on my machine, but in any case even that should go away if you make your starting integer larger.)

For Complex numbers, LLVM can’t do anything, even with @fastmath, but with @fastpow it should faster for all powers (not just “up to 6”, which is an artifact of your unrolling the power in the most inefficient way and hoping that the compiler will rearrange to compensate, which it can’t for more general number types):