# Slow arbitrary base exponentiation, a^b

Hey everyone, I’ve been having fun optimising my code and I’ve noticed that raising a Float64 to the power of another Float64 (e.g. `a^b`) is very slow compared to base e exponentiation using the `exp()` function, even to the point where if I use change of base and evaluate `exp(log(a)*b)`, I still get a considerable speed up, even with the overhead of evaluating `log(a)`. I have some example code and benchmarks below.

``````using BenchmarkTools

a = rand(1000)
b = rand(1000)
c = zeros(1000)

function directexp(a,b,c)
for i in 1:length(c)
c[i] = a[i]^b[i]
end
end

function cobexp(a,b,c)
for i in 1:length(c)
c[i] = exp(b[i]*log(a[i]))
end
end

function eulerexp(a,b,c)
for i in 1:length(c)
c[i] = ℯ^b[i]
# c[i] = exp(b[i])
end
end

@btime directexp(a,b,c) # 68.369 μs (0 allocations: 0 bytes)
@btime cobexp(a,b,c)    # 17.991 μs (0 allocations: 0 bytes)
@btime eulerexp(a,b,c)  #  7.197 μs (0 allocations: 0 bytes)
``````

Where the result from `directexp()` and `cobexp()` only differ in the last decimal place.

Now, I understand that obviously floating point exponentiation is an expensive operation, and it’s clear that Julia has a highly optimised implementation for base e exponentiation, but why is arbitrary base exponentiation so much slower?

1 Like

On my computer `directexp` is about 25% faster, not 3x slower. (Julia 1.1, intel mac). What’s your setup?

Edit: some numbers, what does seem odd is that `cobexp` is slower than the sum of log and exp alone. Some kind of SIMD magic?

``````julia> @btime directexp(a,b,c); # all with @inbounds
15.272 μs (0 allocations: 0 bytes)

julia> @btime cobexp(a,b,c);
19.130 μs (0 allocations: 0 bytes)

julia> @btime eulerexp(a,b,c);
6.841 μs (0 allocations: 0 bytes)

julia> @btime baseexp(a,b,c); # literally exp
6.932 μs (0 allocations: 0 bytes)

julia> @btime baselog(a,b,c); # log alone
5.293 μs (0 allocations: 0 bytes)

julia> function cobexp4(a,b,c)
for i in 1:length(c)
@inbounds c[i] = log(a[i])
end
for i in 1:length(c)
@inbounds c[i] = exp(b[i] * c[i])
end
c
end
cobexp4 (generic function with 1 method)

julia> @btime cobexp4(a,b,c);
12.375 μs (0 allocations: 0 bytes)
``````

my timings:

``````directexp: 14.412 μs
cobexp: 19.790 μs
eulerexp: 6.654 μs
``````

intel 7700k @4.6ghz

That’s really weird. I get I also get

``````directexp: 71.476 μs
cobexp: 23.339 μs
eulerexp: 7.749 μs
``````

on my home computer. Both are running Julia 1.1.1 from Atom/Juno, on windows, one with an intel 7700HQ, the other with an intel 8550U. Maybe it’s a windows issue?

EDIT:
Evaluating on JuliaBox gives me

``````# 0.6.4 Kernel, using the c[i] = exp(b[i]) line in eulerexp
directexp:  52.800 μs (0 allocations: 0 bytes)
cobexp:     29.800 μs (0 allocations: 0 bytes)
eulerexp:   10.700 μs (0 allocations: 0 bytes)

# 1.0.4 Kernel
directexp:  58.400 μs (0 allocations: 0 bytes)
cobexp:     23.100 μs (0 allocations: 0 bytes)
eulerexp:   10.000 μs (0 allocations: 0 bytes)
``````

The trends also agree when using @threads on both my computers, @fastmath, in my actual code with more complicated expressions, and scales linearly with `length(c)`. I even used lengths big enough to time with a stopwatch and can confirm that the timings are accurate and representative.

Linux machine:

``````julia> @btime directexp(a,b,c)
55.856 μs (0 allocations: 0 bytes)

julia> @btime cobexp(a,b,c)
24.498 μs (0 allocations: 0 bytes)

julia> @btime eulerexp(a,b,c)
8.914 μs (0 allocations: 0 bytes)
``````

Windows machine:

``````julia> @btime directexp(a,b,c)
72.101 μs (0 allocations: 0 bytes)

julia> @btime cobexp(a,b,c)
23.900 μs (0 allocations: 0 bytes)

julia> @btime eulerexp(a,b,c)
8.267 μs (0 allocations: 0 bytes)
``````

(Julia 1.1)

MacOS:

``````jl> @btime directexp(\$a,\$b,\$c)
24.212 μs (0 allocations: 0 bytes)

jl> @btime cobexp(\$a,\$b,\$c)
28.042 μs (0 allocations: 0 bytes)

jl> @btime eulerexp(\$a,\$b,\$c)
9.911 μs (0 allocations: 0 bytes)
``````

It looks like Julia just calls `llvm.pow.f64` for `^`, so maybe it depends on how llvm was compiled.

It’s fast for me (Julia and LLVM compiled from source with -march=native):

``````julia> @btime directexp(\$a,\$b,\$c)
13.692 μs (0 allocations: 0 bytes)

julia> @btime cobexp(\$a,\$b,\$c)
20.767 μs (0 allocations: 0 bytes)

julia> @btime eulerexp(\$a,\$b,\$c)
5.445 μs (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.3.0-DEV.395
Commit b7774095f6 (2019-06-11 21:06 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-8.0.0 (ORCJIT, skylake)

julia> @code_native directexp(a,b,c)
.text
; ┌ @ REPL:2 within `directexp'
pushq   %rbp
pushq   %r15
pushq   %r14
pushq   %r13
pushq   %r12
pushq   %rbx
pushq   %rax
movq    %rsi, (%rsp)
movq    16(%rsi), %r14
; │┌ @ array.jl:200 within `length'
movq    8(%r14), %r15
; │└
; │┌ @ range.jl:5 within `Colon'
; ││┌ @ range.jl:275 within `UnitRange'
; │││┌ @ range.jl:280 within `unitrange_last'
; ││││┌ @ operators.jl:341 within `>='
; │││││┌ @ int.jl:424 within `<='
testq   %r15, %r15
; │└└└└└
jle     L85
movq    (%rsi), %r12
movq    8(%rsi), %r13
xorl    %ebx, %ebx
movabsq \$pow, %rbp
nop
; │ @ REPL:3 within `directexp'
; │┌ @ array.jl:728 within `getindex'
L48:
movq    (%r12), %rax
vmovsd  (%rax,%rbx,8), %xmm0    # xmm0 = mem,zero
movq    (%r13), %rax
vmovsd  (%rax,%rbx,8), %xmm1    # xmm1 = mem,zero
; │└
; │┌ @ math.jl:793 within `^'
callq   *%rbp
; │└
; │┌ @ array.jl:766 within `setindex!'
movq    (%r14), %rax
vmovsd  %xmm0, (%rax,%rbx,8)
; │└
; │┌ @ range.jl:595 within `iterate'
; ││┌ @ promotion.jl:403 within `=='
cmpq    %rbx, %r15
; │└└
jne     L48
L85:
movabsq \$jl_system_image_data, %rax
; │ @ REPL:3 within `directexp'
popq    %rbx
popq    %r12
popq    %r13
popq    %r14
popq    %r15
popq    %rbp
retq
nop
; └

``````
3 Likes

Yeah, I saw that it was using a more or less direct LLVM call, but didn’t really know enough about it to go any further. It says I’m using libLLVM-6.0.1 (ORCJIT, skylake), so maybe it’s just the LLVM version difference?

EDIT:
I also noticed that `log(x)` and `exp(x)` are implemented with their own specialised algorithms in julia, rather than directly calling an LLVM function, which may explain why `cobexp()` and `eulerexp()` have been so consistent between people.

BTW, if you don’t need perfect accuracy, your function will run in 3.3μs and 6.1μs for Float32 and Float64 respectively using `fastpow` below. They’re pure Julia, so they should be fast regardless of LLVM.

``````# https://github.com/etheory/fastapprox/blob/master/fastapprox/src/fastlog.h
function fastlog2(x::Float32)::Float32
y = Float32(reinterpret(Int32, x))
y *= 1.1920928955078125f-7
y - 126.94269504f0
end
function fastlog2(x::Float64)::Float32
fastlog2(Float32(x))
end

# https://github.com/etheory/fastapprox/blob/master/fastapprox/src/fastexp.h
function fastpow2(x::Float32)::Float32
clipp = x < -126.0f0 ? -126.0f0 : x
clipp = min(126f0, max(-126f0, x))
reinterpret(Float32, UInt32((1 << 23) * (clipp + 126.94269504f0)))
end
function fastpow2(x::Float64)::Float32
fastpow2(Float32(x))
end

# https://github.com/etheory/fastapprox/blob/master/fastapprox/src/fastpow.h
function fastpow(x::Real, y::Real)::Real
fastpow2(y * fastlog2(x))
end
``````
3 Likes

In economics it’s common for a large fraction of runtime to be spent calling the pow function (Ex: CRRA utility) so it would be great if there was a consistent way for it to be fast. Is compiling from source the way to go? Would a native sysimg suffice?

FWIW My Julia is compiled from github source

Thanks Rob, that’s been really helpful. I don’t have admin privileges on my work computer, so I doubt it’s gonna be convenient to install all the dependencies required to build Julia from source here, but I’ll give it a try on my home computer and report back if it makes a difference.

In practice the fastpow function is just on the border of being accurate enough, so I’ll have a play around with that (maybe compromise by using `exp2(y*fastlog2(x))` instead of `fastpow2(y*fastlog2(x))`, which seems to be good enough for my purposes). In particular, I’d like my code to be reasonably portable, so the version/compilation dependence of the `llvm.pow.f64` isn’t great for that, even if I get it running faster on my machine.

It does seem strange that such a fundamental operation would be so variable between installations, as I’d imagine quite a lot of codes would rely on performant exponentiation, which seems to be varying by a factor of ~5 between people.

In case you didn’t notice already, the references I noted in my prior post have two versions of the approximations, e.g., `fastpow` and `fasterpow`. Despite the naming in my Julia translation, I implemented the faster approximations, so if you’d like a little more accuracy, you could try the original fast versions.

1 Like

I am now working on a GE model with a lot of `^` in the equations (CES aggregator, production functions, CRRA utility; solution method is value iteration [because of finite lifetimes], then rootfinding for market clearing), so I was curious about this: according to profiling, apparently in my code about 9-12% is spent on `^`. The code is otherwise heavily optimized.

It is of course always great to have things run faster, but I am not sure this is really an outstanding bottleneck. YMMV.

1 Like

I’m curious what do you get on the benchmark in the original post? For me `directexp` was several times slower than the others, but other people seem to have better optimization.

For a starker example, if I use the code available here https://github.com/jesusfv/Comparison-Programming-Languages-Economics/blob/master/RBC_Julia.jl but change the utility from log to CRRA, my timing goes from 1.3s to 5.5s. The number of iterations is about the same.

In a more complex model like yours it wouldn’t be as much of a bottleneck.

You’re not allowed to do `git clone ...` into your home folder? Julia downloads all depencies you need while compiling, AFAIK the only thing you really need is a compiler. If you’re working on a Windows, I don’t know how that works out

1 Like

I am getting

``````  20.319 μs (0 allocations: 0 bytes)
26.976 μs (0 allocations: 0 bytes)
8.874 μs (0 allocations: 0 bytes)
``````

on a pre-1.2.

But I am not convinced that `eulerexp` is comparable to the other two, it is not using the `a`. Correcting that,

``````julia> @btime eulerexp(a,b,c)
27.095 μs (0 allocations: 0 bytes)
``````

so after much ado, both options would make my code slower.

1 Like

Other options…

``````julia> @noinline ThrowLengthMismatchError() = throw("Lengths aren't equal!")
ThrowLengthMismatchError (generic function with 1 method)

julia> function directexp!(c,a,b)
N = length(a)
N == length(b) == length(c) || ThrowLengthMismatchError()
@inbounds for n in 1:N
c[n] = a[n]^b[n]
end
end
directexp! (generic function with 1 method)

julia> function cobexp!(c,a,b)
N = length(a)
N == length(b) == length(c) || ThrowLengthMismatchError()
@inbounds for n in 1:N
c[n] = exp(b[n]*log(a[n]))
end
end
cobexp! (generic function with 1 method)

julia> syspower(a, b) = ccall(:pow,Float64,(Float64,Float64),a,b)
syspower (generic function with 1 method)

julia> function syspow!(c,a,b)
N = length(a)
N == length(b) == length(c) || ThrowLengthMismatchError()
@inbounds for n in 1:N
c[n] = syspower(a[n], b[n])
end
end
syspow! (generic function with 1 method)

julia> using SIMDPirates, LoopVectorization
[ Info: Recompiling stale cache file /home/chriselrod/.julia/compiled/v1.3/LoopVectorization/4TogI.ji for LoopVectorization [bdcacae8-1622-11e9-2a5c-532679323890]

julia> using SLEEFPirates, SLEEFwrap, xsimdwrap
[ Info: Recompiling stale cache file /home/chriselrod/.julia/compiled/v1.3/SLEEFwrap/GD9JX.ji for SLEEFwrap [6e748a96-c50b-11e8-2c8a-ed394f64ea02]
[ Info: Recompiling stale cache file /home/chriselrod/.julia/compiled/v1.3/xsimdwrap/NPDTn.ji for xsimdwrap [b9b7a223-629f-5757-a0fe-5fc90c39777c]

julia> # requires unregistered dependencies:
# https://github.com/chriselrod/VectorizationBase.jl#master
# https://github.com/chriselrod/SIMDPirates.jl#master
# https://github.com/chriselrod/LoopVectorization.jl#master
# https://github.com/chriselrod/SLEEFPirates.jl#master
# https://github.com/chriselrod/SLEEFwrap.jl#master
# https://github.com/chriselrod/xsimdwrap.jl#master

@generated function jsleefpow!(c::AbstractArray{T}, a::AbstractArray{T}, b::AbstractArray{T}) where {T}
quote
N = length(a)
N == length(b) == length(c) || ThrowLengthMismatchError()
@vectorize \$T for n in 1:N
c[n] = a[n] ^ b[n]
end
end
end
jsleefpow! (generic function with 1 method)

julia> @generated function csleefpow!(c::AbstractArray{T}, a::AbstractArray{T}, b::AbstractArray{T}) where {T <: Union{Float32,Float64}}
quote
N = length(a)
N == length(b) == length(c) || ThrowLengthMismatchError()
@cvectorize \$T for n in 1:N
c[n] = a[n] ^ b[n]
end
end
end
csleefpow! (generic function with 1 method)

julia> @generated function xsimdpow!(c::AbstractArray{T}, a::AbstractArray{T}, b::AbstractArray{T}) where {T <: Union{Float32,Float64}}
quote
N = length(a)
N == length(b) == length(c) || ThrowLengthMismatchError()
@xvectorize \$T for n in 1:N
c[n] = a[n] ^ b[n]
end
end
end
xsimdpow! (generic function with 1 method)

julia> @generated function jsleefpowcob!(c::AbstractArray{T}, a::AbstractArray{T}, b::AbstractArray{T}) where {T}
quote
N = length(a)
N == length(b) == length(c) || ThrowLengthMismatchError()
@vectorize \$T for n in 1:N
c[n] = exp(b[n] * log(a[n]))
end
end
end
jsleefpowcob! (generic function with 1 method)

julia> @generated function csleefpowcob!(c::AbstractArray{T}, a::AbstractArray{T}, b::AbstractArray{T}) where {T <: Union{Float32,Float64}}
quote
N = length(a)
N == length(b) == length(c) || ThrowLengthMismatchError()
@cvectorize \$T for n in 1:N
c[n] = exp(b[n] * log(a[n]))
end
end
end
csleefpowcob! (generic function with 1 method)

julia> @generated function xsimdpowcob!(c::AbstractArray{T}, a::AbstractArray{T}, b::AbstractArray{T}) where {T <: Union{Float32,Float64}}
quote
N = length(a)
N == length(b) == length(c) || ThrowLengthMismatchError()
@xvectorize \$T for n in 1:N
c[n] = exp(b[n] * log(a[n]))
end
end
end
xsimdpowcob! (generic function with 1 method)

julia> a = rand(1000);

julia> b = rand(1000);

julia> c = zeros(1000);

julia> using BenchmarkTools

julia> @benchmark directexp!(\$c, \$a, \$b)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     14.081 μs (0.00% GC)
median time:      14.204 μs (0.00% GC)
mean time:        14.240 μs (0.00% GC)
maximum time:     24.483 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     1

julia> @benchmark cobexp!(\$c, \$a, \$b)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     19.695 μs (0.00% GC)
median time:      19.928 μs (0.00% GC)
mean time:        19.970 μs (0.00% GC)
maximum time:     30.733 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     1

julia> @benchmark syspow!(\$c, \$a, \$b)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     12.908 μs (0.00% GC)
median time:      12.998 μs (0.00% GC)
mean time:        13.031 μs (0.00% GC)
maximum time:     23.834 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     1

julia> @benchmark jsleefpow!(\$c, \$a, \$b)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     6.537 μs (0.00% GC)
median time:      6.562 μs (0.00% GC)
mean time:        6.633 μs (0.00% GC)
maximum time:     11.406 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     5

julia> @benchmark csleefpow!(\$c, \$a, \$b)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     6.052 μs (0.00% GC)
median time:      6.094 μs (0.00% GC)
mean time:        6.112 μs (0.00% GC)
maximum time:     9.819 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     5

julia> @benchmark xsimdpow!(\$c, \$a, \$b)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     4.179 μs (0.00% GC)
median time:      4.191 μs (0.00% GC)
mean time:        4.202 μs (0.00% GC)
maximum time:     6.713 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     7

julia> @benchmark jsleefpowcob!(\$c, \$a, \$b)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     2.442 μs (0.00% GC)
median time:      2.480 μs (0.00% GC)
mean time:        2.481 μs (0.00% GC)
maximum time:     22.302 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     9

julia> @benchmark csleefpowcob!(\$c, \$a, \$b)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     2.413 μs (0.00% GC)
median time:      2.422 μs (0.00% GC)
mean time:        2.435 μs (0.00% GC)
maximum time:     4.540 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     9

julia> @benchmark xsimdpowcob!(\$c, \$a, \$b)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     3.735 μs (0.00% GC)
median time:      3.746 μs (0.00% GC)
mean time:        3.755 μs (0.00% GC)
maximum time:     5.963 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     8

``````

All times above were in microseconds.
My best times were through `exp(b * log(a))` with SLEEF, and my worst through `Base.exp(b * Base.exp(a))`. My `Base.:^` was faster. I also compiled Julia from source, with `MARCH=native` in my Make.user.

The vectorized functions were naturally the fastest. The clear winner, if you can stomach the loss of precision, is @robsmith11’s version (which vectorizes after replacing a `UInt32(...)` with `unsafe_trunc(UInt32, ...)`.

``````julia> # https://github.com/etheory/fastapprox/blob/master/fastapprox/src/fastlog.h
function fastlog2(x::Float32)::Float32
y = Float32(reinterpret(Int32, x))
y *= 1.1920928955078125f-7
y - 126.94269504f0
end
fastlog2 (generic function with 1 method)

julia> function fastlog2(x::Float64)::Float32
fastlog2(Float32(x))
end
fastlog2 (generic function with 2 methods)

julia> # https://github.com/etheory/fastapprox/blob/master/fastapprox/src/fastexp.h
function fastpow2(x::Float32)::Float32
clipp = x < -126.0f0 ? -126.0f0 : x
clipp = min(126f0, max(-126f0, x))
reinterpret(Float32, Base.unsafe_trunc(UInt32, (1 << 23) * (clipp + 126.94269504f0)))
end
fastpow2 (generic function with 1 method)

julia> function fastpow2(x::Float64)::Float32
fastpow2(Float32(x))
end
fastpow2 (generic function with 2 methods)

julia> # https://github.com/etheory/fastapprox/blob/master/fastapprox/src/fastpow.h
function fastpow(x::Real, y::Real)::Real
fastpow2(y * fastlog2(x))
end
fastpow (generic function with 1 method)

julia> function fastpow!(c,a,b)
N = length(a)
N == length(b) == length(c) || ThrowLengthMismatchError()
@inbounds @simd ivdep for n in 1:N
c[n] = fastpow(a[n], b[n])
end
end
fastpow! (generic function with 1 method)

julia> @benchmark fastpow!(\$c, \$a, \$b)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     617.977 ns (0.00% GC)
median time:      619.324 ns (0.00% GC)
mean time:        629.481 ns (0.00% GC)
maximum time:     829.486 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     173
``````

I’d be worried that this could lead to convergence issues if you use these functions for calculating log densities and gradients for Hamiltonian Monte Carlo. Anyone know the level of accuracy required?
For now, I’ll stick with SLEEF/XSIMD, which still performed quite well and are accurate over a reasonable range.
I’ve been bitten by single precision before, when many Cholesky factorizations that were fine with double precision failed when I used single (under the idea vectorized code would run twice as fast), so I’m weary of going single-precision.

EDIT:
A look at the accuracy. I defined:

``````using Statistics
a = rand(1000);
b = rand(1000);
c = zeros(1000);
const C_EXACT = @. Float64( big(a) ^ big(b) );
mean_exact(c) = mean(c .== C_EXACT)
mean_approx(c) = mean(c .≈ C_EXACT)
function ulp_diff(x, y)
x, y = minmax(x, y)
est = round(Int, (y - x) / eps(x) )
while true
z = nextfloat(x, est)
z == y && return est
est += round(Int, (z - y) / eps(z))
end
end
mean_error(c) = mean(ulp_diff.(c, C_EXACT))
``````

The `mean_error` function gives the number of floating points numbers by which they’re off.

Now, rerunning the benchmarks, and comparing the accuracy with the rounded `BigFloat` answers (over an admittedly very narrow subset of `pow`'s domain):

``````julia> @benchmark directexp!(\$c, \$a, \$b)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     13.715 μs (0.00% GC)
median time:      14.199 μs (0.00% GC)
mean time:        14.338 μs (0.00% GC)
maximum time:     227.472 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     1

julia> mean_exact(c), mean_approx(c), mean_error(c)
(1.0, 1.0, 0.0)

julia> @benchmark cobexp!(\$c, \$a, \$b)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     18.915 μs (0.00% GC)
median time:      19.161 μs (0.00% GC)
mean time:        19.280 μs (0.00% GC)
maximum time:     251.692 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     1

julia> mean_exact(c), mean_approx(c), mean_error(c)
(0.815, 1.0, 0.201)

julia> @benchmark syspow!(\$c, \$a, \$b)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     13.079 μs (0.00% GC)
median time:      13.146 μs (0.00% GC)
mean time:        13.234 μs (0.00% GC)
maximum time:     30.236 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     1

julia> mean_exact(c), mean_approx(c), mean_error(c)
(1.0, 1.0, 0.0)

julia> @benchmark jsleefpow!(\$c, \$a, \$b)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     6.635 μs (0.00% GC)
median time:      6.651 μs (0.00% GC)
mean time:        6.675 μs (0.00% GC)
maximum time:     10.282 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     5

julia> mean_exact(c), mean_approx(c), mean_error(c)
(0.828, 1.0, 0.19)

julia> @benchmark csleefpow!(\$c, \$a, \$b)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     6.036 μs (0.00% GC)
median time:      6.063 μs (0.00% GC)
mean time:        6.101 μs (0.00% GC)
maximum time:     9.794 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     5

julia> mean_exact(c), mean_approx(c), mean_error(c)
(0.984, 1.0, 0.016)

julia> @benchmark xsimdpow!(\$c, \$a, \$b)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     4.101 μs (0.00% GC)
median time:      4.190 μs (0.00% GC)
mean time:        4.182 μs (0.00% GC)
maximum time:     6.699 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     7

julia> mean_exact(c), mean_approx(c), mean_error(c)
(0.69, 1.0, 0.325)

julia> @benchmark jsleefpowcob!(\$c, \$a, \$b)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     2.440 μs (0.00% GC)
median time:      2.449 μs (0.00% GC)
mean time:        2.467 μs (0.00% GC)
maximum time:     4.405 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     9

julia> mean_exact(c), mean_approx(c), mean_error(c)
(0.673, 1.0, 0.35)

julia> @benchmark csleefpowcob!(\$c, \$a, \$b)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     2.413 μs (0.00% GC)
median time:      2.423 μs (0.00% GC)
mean time:        2.427 μs (0.00% GC)
maximum time:     4.557 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     9

julia> mean_exact(c), mean_approx(c), mean_error(c)
(0.787, 1.0, 0.246)

julia> @benchmark xsimdpowcob!(\$c, \$a, \$b)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     3.647 μs (0.00% GC)
median time:      3.695 μs (0.00% GC)
mean time:        3.690 μs (0.00% GC)
maximum time:     5.989 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     8

julia> mean_exact(c), mean_approx(c), mean_error(c)
(0.69, 1.0, 0.325)

julia> @benchmark fastpow!(\$c, \$a, \$b)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     617.277 ns (0.00% GC)
median time:      644.266 ns (0.00% GC)
mean time:        644.124 ns (0.00% GC)
maximum time:     1.737 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     173

julia> mean_exact(c), mean_approx(c), mean_error(c)
ERROR: InexactError: trunc(Int64, NaN)
Stacktrace:
 trunc at ./float.jl:696 [inlined]
 round at ./float.jl:361 [inlined]
 ulp_diff(::Float64, ::Float64) at ./REPL:7
 macro expansion at ./broadcast.jl:888 [inlined]
 macro expansion at ./simdloop.jl:77 [inlined]
The system `pow` (called using `ccall(:pow,...)`) and Julia’s default were the most numerically accurate. Unsurprisingly, `exp(b * log(a))` was generally less accurate than `pow`.