Should `literal_pow` optimize for bigger exponents?

MarcMush · February 10, 2022, 2:02pm

Base.literal_pow allows x^2 and x^3 to be turned into x*x and x*x*x, turning it into very efficient machine code. I was wondering if it could be extended to higher exponents. The function I used for this is this one:

Base.literal_pow(::typeof(^),x::Number,n::Val{N}) where N = prod(ntuple(Returns(x),Val(N)))

I found that for exponents up to x^32, the compiled versions are faster, but the compile time gets longer and longer:

Code used to generate the data

using BenchmarkTools, ProgressMeter

#Base.literal_pow(::typeof(^),x::Number,n::Val{N}) where N = prod(ntuple(Returns(x),Val(N)))

t1 = Float64[]
t2 = Float64[]

@showprogress for n in 0:50
	@eval f(x) = x^$n
	a = 2
	push!(t1, @elapsed f(2))
	push!(t2, @belapsed f($(Ref(a))[]))
end

My versioninfo()

julia> versioninfo()
Julia Version 1.7.2
Commit bf53498635 (2022-02-06 15:21 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)

up to x^26, the compiled code is very efficient:

julia> @code_llvm (x->x^26)(2)
;  @ REPL[9]:1 within `#23`
; Function Attrs: uwtable
define i64 @"julia_#23_1634"(i64 signext %0) #0 {
top:
; ┌ @ C:\Users\mittel\juliastuff\literal_pow.jl:3 within `literal_pow`
; │┌ @ tuple.jl:499 within `prod`
; ││┌ @ operators.jl:655 within `*`
; │││┌ @ operators.jl:634 within `afoldl`
; ││││┌ @ int.jl:88 within `*`
       %1 = mul i64 %0, %0
       %2 = mul i64 %1, %0
       %3 = mul i64 %2, %2
       %4 = mul i64 %3, %3
       %5 = mul i64 %4, %0
       %6 = mul i64 %5, %5
; └└└└└
  ret i64 %6
}

between x^27 and x^32 the compiled code is more complicated but apparently it is still ok, but after x^33 it’s catastrophical.

Would it make sense to extend this? (up to 5? 10? 26?)

Also I don’t know how it behaves on other architectures or with other types (Float64, Float32, Int32…)

edit: corrected benchmark

Oscar_Smith · February 10, 2022, 2:06pm

This doesn’t work because it is too inaccurate. However, in 1.8, pow with integer exponents got a bunch faster (it uses compensated power by squaring).

MarcMush · February 10, 2022, 2:13pm

do you mean for floats? it shouldn’t be a problem for integers?

Oscar_Smith · February 10, 2022, 2:19pm

Oh, sorry. I missed that you were timing integers. This still seems really weird. Are you sure your benchmark isn’t just constant folding the whole benchmark away?

MarcMush · February 10, 2022, 2:26pm

maybe…

the @code_llvm looks optimized and beautiful:

julia> @code_llvm (x->x^23)(2)
;  @ REPL[23]:1 within `#45`
; Function Attrs: uwtable
define i64 @"julia_#45_1838"(i64 signext %0) #0 {
top:
; ┌ @ C:\Users\mittel\juliastuff\literal_pow.jl:3 within `literal_pow`
; │┌ @ tuple.jl:499 within `prod`
; ││┌ @ operators.jl:655 within `*`
; │││┌ @ operators.jl:631 within `afoldl`
; ││││┌ @ int.jl:88 within `*`
       %1 = mul i64 %0, %0  # x^2
       %2 = mul i64 %1, %1  # x^4
       %3 = mul i64 %2, %0  # x^5
       %4 = mul i64 %3, %3  # x^10
       %5 = mul i64 %4, %0  # x^11
       %6 = mul i64 %5, %0  # x^12
       %7 = mul i64 %6, %5  # x^23
; └└└└└
  ret i64 %7
}

GunnarFarneback · February 10, 2022, 2:37pm

It’s definitely constant folded. There’s no such thing as pikosecond operations on today’s computers. There’s some discussion about this and how to avoid it in the BenchmarkTools README.

MarcMush · February 10, 2022, 2:56pm

indeed

with this corrected code

using BenchmarkTools, ProgressMeter

#Base.literal_pow(::typeof(^),x::Number,n::Val{N}) where N = prod(ntuple(Returns(x),Val(N)))

t1 = Float64[]
t2 = Float64[]

@showprogress for n in 0:50
	@eval f(x) = x^$n
	a = 2
	push!(t1, @elapsed f(2))
	push!(t2, @belapsed f($(Ref(a))[]))
end

belapsed
belapsedlog

Oscar_Smith · February 10, 2022, 2:58pm

I think we might be able to get the same result (without the performance cliff) by doing more aggressive constant propagation here.

stevengj · February 10, 2022, 2:58pm

prod(ntuple(...)) is a very suboptimal way to compute large powers — you want to use repeated squaring, or more generally an optimal addition chain. This is implemented in:

but it isn’t the default for literal_pow because it is slightly less accurate (for floating-point types).

LLVM only does this by default for integer types; for floating-point types you have to use @fastmath because it changes (worsens) the roundoff errors. The FastPow package extends this to other types beyond the small set of built-in types supported by LLVM.

Oscar_Smith · February 10, 2022, 3:00pm

If you look at the generated code, it looks like LLVM is smart enough to figure this out by itself.

MarcMush · February 10, 2022, 3:02pm

only up to 26

stevengj · February 10, 2022, 3:06pm

Benchmarking powers for integer types is pretty misleading.

In practice, you are almost always going to use floating-point types in calculations involving large exponents, to avoid overflow (unless you are doing number theory, in which case you will be using BigInt). And LLVM optimizes products of floating point numbers very differently because floating-point multiplication is not associative.

stevengj · February 10, 2022, 3:22pm

In particular, here is the result (on my 2016 Intel laptop) with @fastpow, benchmarking for floating-point exponentiation:

benchmark code

using BenchmarkTools, ProgressMeter, FastPow

t_old = Float64[]
t_new = Float64[]

@showprogress for n in 0:50
	@eval f(x) = x^$n
	@eval @fastpow g(x) = x^$n
	a = 2.0
	push!(t_old, @belapsed f($(Ref(a))[]))
	push!(t_new, @belapsed g($(Ref(a))[]))
end

Oscar_Smith · February 10, 2022, 3:28pm

See also https://github.com/JuliaLang/julia/pull/44106 which should allow better constant prop through ^

MarcMush · February 10, 2022, 3:51pm

I found an improvement for BigInt up to 5, for Complex{Int64} up to 6, and nothing for Float64 compared to @fastmath

bigint
complex
float64

stevengj · February 10, 2022, 4:44pm

As I said above, LLVM knows how to optimize small-integer exponents for a handful of built-in types, including floating-point types if you use @fastmath, but won’t do anything for more general types (including complex numbers).

For BigInt, in general I wouldn’t expect any improvement from unrolling, since the arithmetic is so expensive there that the underlying GMP library can already afford to do something pretty good, and the optimal algorithm is more complicated. (I can’t reproduce your improvement for small powers on my machine, but in any case even that should go away if you make your starting integer larger.)

For Complex numbers, LLVM can’t do anything, even with @fastmath, but with @fastpow it should faster for all powers (not just “up to 6”, which is an artifact of your unrolling the power in the most inefficient way and hoping that the compiler will rearrange to compensate, which it can’t for more general number types):

Topic		Replies	Views
Compiler optimizations on integer exponentation Performance	3	398	July 7, 2021
Small, fixed powers Performance	5	505	May 28, 2021
Exponentiation with literals in both base and exponent not removed by compiler in 1.8 Performance compilation , math	10	593	July 6, 2022
Power function not inlined optimally Performance llvm , compilation	5	1556	January 4, 2019
Speed of power "^" Performance	16	983	May 6, 2022

Should `literal_pow` optimize for bigger exponents?

Related topics