Exponentiation of floats

Hi, I noticed that there is a significant loss of performance between Julia 1.7.3 and 1.9.x for exponentiation of floats, which becomes ~ 5 times slower. I tested this independently on a few machines. A minimal example of this is attached below.

using BenchmarkTools

function test()
    x = 1.
    for i in 1:100_000_000
        x += x^2.
    end
    return x
end

@btime test()

The problem remains in the available version of Julia 1.10.x, which is in fact 2 times slower than 1.9.x.

Thanks!

3 Likes

This is aside from the regression, but do you really need to exponentiate by a float? I see 6x faster performance with x += x^2 (without the .).

1 Like

idk if OP needs it but some fields need it:
https://cds.cern.ch/record/2866130/files/ANA-EXOT-2022-18-PAPER.pdf#page=9

1 Like

Yes, I am aware that if one uses an integer it is much faster. However, I do require exponentiating by floats in general and this is a very significant difference between older and newer version of Julia.

3 Likes

how fast was the older version? I have opened an issue: Exponential by a float is very slow Β· Issue #52079 Β· JuliaLang/julia Β· GitHub

That minimal example on 1.7.3 is ~ 220 ms whereas on 1.9.3 it is ~ 900 ms and on 1.10.x 2100 ms.

for me it’s same speed on 1.7:

julia> @benchmark x^y setup=begin x=rand(); y=rand()end
BenchmarkTools.Trial: 10000 samples with 989 evaluations.
 Range (min … max):  46.524 ns … 126.397 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     51.524 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   54.200 ns Β±   8.123 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆ           ▁▅                        β–„β–‡               β–‚β–„    β–‚
  β–ˆβ–ƒβ–β–‡β–ˆβ–ˆβ–‡β–ˆβ–‡β–‡β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–…β–…β–†β–ƒβ–„β–„β–β–…β–β–„β–ƒβ–ƒβ–„β–ƒβ–„β–ƒβ–β–β–ƒβ–ƒβ–β–ˆβ–ˆβ–†β–„β–…β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–†β–…β–‡β–‡β–ˆβ–ˆβ–†β–†β–‡ β–ˆ
  46.5 ns       Histogram: log(frequency) by time      69.4 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

For what it’s worth in this example, the 2. is getting converted to 2 inside the power function.

In the llvm IR of test on 1.9.3, same thing occurs on 1.10-rc1

%1 = call double @j_pow_body_301(double %value_phi1, i64 signext 2) #0
2 Likes

I see a 2x regression from 1.9.3 to 1.10-rc1. The @noinline x^yint doesn’t inline/ const prop as well as it does on 1.9.3 which could be the problem.

EDIT: Manually bypassed the call in 1.10 to ^ and inserted a pow_body call to match 1.9.3 but didn’t make a difference. Likely the regression is in pow_body as otherwise the LLVM IR seems to be identical.

1 Like

Do you happen to be running this on a fairly old CPU? For cpus with FMA (haswell/bulldozer or newer) floating point powers should have gotten a lot faster.

2 Likes

Note that this is a pretty bad benchmark since 1^x is a special case so you are bench-marking the speed of a specific case rather than floating point exponentiation in general.

1 Like

I don’t think so,

julia> versioninfo()
Julia Version 1.9.3
Commit bed2cd540a (2023-08-24 14:43 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 16 Γ— AMD Ryzen 7 5700U with Radeon Graphics
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver2)
  Threads: 1 on 16 virtual cores

I’m fairly confused though as far as I can tell there’s been no change to the code between 1.9.3 and 1.10-rc1 and the LLVM IR is near identical. Especially as ^2 is the same speed on both versions.

The function is adding to x so it’s only 1^2. on the first iteration.

Oh, right. It’s still a bad benchmark, but for a different reason After the first 12 iterations, it’s computing Inf^2 which is also a special case.

4 Likes

You’re right that something is wrong with 1.10 (not sure what yet).
A better version of this benchmark would be

function test()
    x = 1.1
    for i in 1:100_000_000
        x = x^2.0
        x > 1e160 && (x = 1.1)
    end
    return x
end

Which gives me ~740 ms on 1.9 and nightly, but 1.3s on 1.10. This is very odd since we didn’t change anything here between 1.10 and nightly as far as I know.

1 Like

Yep, this is a better example but the problem is still there. What speed do you get for 1.7.3? In my tests that one is still faster than 1.9.

I can confirm this. That said, this seems to be a regression purely for the case of raising a number to the 2.0 power. If I change the power to 2.1 I get 9.6 seconds on 1.7, vs 3 seconds on 1.9. Also for a power of 3.0 I get a time of 390ms for 1.7 vs 285 for 1.9

I see a regression from 1.9 to 1.10 for 2.0, 3.0 (197ms to 320ms) and 4.0 (817ms to 1.6s) and 4.1 (3.5s to 4.5s) for your example.

yeah. the 1.9 to 1.10 issue is real (and very weird)

My profile of 1.10-rc1 unlike in 1.9.3 seems to suggest that the non-fma path for two_mul is being called.

1.10-rc1

1.9.3

Weirdly LLVM IR says we’re using fma path

julia> @code_llvm Base.Math.two_mul(1., 2.)
;  @ math.jl:54 within `two_mul`
; Function Attrs: uwtable
define void @julia_two_mul_451([2 x double]* noalias nocapture noundef nonnull sret([2 x double]) align 8 dereferenceable(16) %0, double %1, double %2) #0 {
common.ret:
;  @ math.jl:56 within `two_mul`
; β”Œ @ float.jl:411 within `*`
   %3 = fmul double %1, %2
; β””
;  @ math.jl:57 within `two_mul`
; β”Œ @ float.jl:407 within `-`
   %4 = fneg double %3
; β””
; β”Œ @ floatfuncs.jl:439 within `fma`
; β”‚β”Œ @ floatfuncs.jl:434 within `fma_llvm`
    %5 = call double @llvm.fma.f64(double %1, double %2, double %4)
; β””β””
  %newstruct.sroa.0.0..sroa_idx = getelementptr inbounds [2 x double], [2 x double]* %0, i64 0, i64 0
  store double %3, double* %newstruct.sroa.0.0..sroa_idx, align 8
  %newstruct.sroa.2.0..sroa_idx8 = getelementptr inbounds [2 x double], [2 x
 double]* %0, i64 0, i64 1
  store double %5, double* %newstruct.sroa.2.0..sroa_idx8, align 8
;  @ math.jl within `two_mul`
  ret void
}
1 Like