Slow arbitrary base exponentiation, a^b

Hey everyone, I’ve been having fun optimising my code and I’ve noticed that raising a Float64 to the power of another Float64 (e.g. a^b) is very slow compared to base e exponentiation using the exp() function, even to the point where if I use change of base and evaluate exp(log(a)*b), I still get a considerable speed up, even with the overhead of evaluating log(a). I have some example code and benchmarks below.

using BenchmarkTools

a = rand(1000)
b = rand(1000)
c = zeros(1000)

function directexp(a,b,c)
  for i in 1:length(c)
    c[i] = a[i]^b[i]
  end
end

function cobexp(a,b,c)
  for i in 1:length(c)
    c[i] = exp(b[i]*log(a[i]))
  end
end

function eulerexp(a,b,c)
  for i in 1:length(c)
    c[i] = ℯ^b[i]
    # c[i] = exp(b[i])
  end
end

@btime directexp(a,b,c) # 68.369 μs (0 allocations: 0 bytes)
@btime cobexp(a,b,c)    # 17.991 μs (0 allocations: 0 bytes)
@btime eulerexp(a,b,c)  #  7.197 μs (0 allocations: 0 bytes)

Where the result from directexp() and cobexp() only differ in the last decimal place.

Now, I understand that obviously floating point exponentiation is an expensive operation, and it’s clear that Julia has a highly optimised implementation for base e exponentiation, but why is arbitrary base exponentiation so much slower?

On my computer directexp is about 25% faster, not 3x slower. (Julia 1.1, intel mac). What’s your setup?

Edit: some numbers, what does seem odd is that cobexp is slower than the sum of log and exp alone. Some kind of SIMD magic?

julia> @btime directexp(a,b,c); # all with @inbounds
  15.272 μs (0 allocations: 0 bytes)

julia> @btime cobexp(a,b,c);
  19.130 μs (0 allocations: 0 bytes)

julia> @btime eulerexp(a,b,c);
  6.841 μs (0 allocations: 0 bytes)

julia> @btime baseexp(a,b,c); # literally exp
  6.932 μs (0 allocations: 0 bytes)

julia> @btime baselog(a,b,c); # log alone
  5.293 μs (0 allocations: 0 bytes)

julia> function cobexp4(a,b,c)
         for i in 1:length(c)
           @inbounds c[i] = log(a[i])
         end
         for i in 1:length(c)
           @inbounds c[i] = exp(b[i] * c[i])
         end
         c
       end
cobexp4 (generic function with 1 method)

julia> @btime cobexp4(a,b,c);
  12.375 μs (0 allocations: 0 bytes)

my timings:

directexp: 14.412 μs
cobexp: 19.790 μs
eulerexp: 6.654 μs

intel 7700k @4.6ghz

That’s really weird. I get I also get

directexp: 71.476 μs
cobexp: 23.339 μs
eulerexp: 7.749 μs

on my home computer. Both are running Julia 1.1.1 from Atom/Juno, on windows, one with an intel 7700HQ, the other with an intel 8550U. Maybe it’s a windows issue?

EDIT:
Evaluating on JuliaBox gives me

# 0.6.4 Kernel, using the c[i] = exp(b[i]) line in eulerexp
directexp:  52.800 μs (0 allocations: 0 bytes)
cobexp:     29.800 μs (0 allocations: 0 bytes)
eulerexp:   10.700 μs (0 allocations: 0 bytes)

# 1.0.4 Kernel
directexp:  58.400 μs (0 allocations: 0 bytes)
cobexp:     23.100 μs (0 allocations: 0 bytes)
eulerexp:   10.000 μs (0 allocations: 0 bytes)

The trends also agree when using @threads on both my computers, @fastmath, in my actual code with more complicated expressions, and scales linearly with length(c). I even used lengths big enough to time with a stopwatch and can confirm that the timings are accurate and representative.

Linux machine:

julia> @btime directexp(a,b,c)
  55.856 μs (0 allocations: 0 bytes)

julia> @btime cobexp(a,b,c)
  24.498 μs (0 allocations: 0 bytes)

julia> @btime eulerexp(a,b,c)
  8.914 μs (0 allocations: 0 bytes)

Windows machine:

julia> @btime directexp(a,b,c)
  72.101 μs (0 allocations: 0 bytes)

julia> @btime cobexp(a,b,c) 
  23.900 μs (0 allocations: 0 bytes)

julia> @btime eulerexp(a,b,c)
  8.267 μs (0 allocations: 0 bytes)

(Julia 1.1)

MacOS:

jl> @btime directexp($a,$b,$c)
  24.212 μs (0 allocations: 0 bytes)

jl> @btime cobexp($a,$b,$c)
  28.042 μs (0 allocations: 0 bytes)

jl> @btime eulerexp($a,$b,$c)  
  9.911 μs (0 allocations: 0 bytes)

It looks like Julia just calls llvm.pow.f64 for ^, so maybe it depends on how llvm was compiled.

It’s fast for me (Julia and LLVM compiled from source with -march=native):

julia> @btime directexp($a,$b,$c)
  13.692 μs (0 allocations: 0 bytes)

julia> @btime cobexp($a,$b,$c)
  20.767 μs (0 allocations: 0 bytes)

julia> @btime eulerexp($a,$b,$c)
  5.445 μs (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.3.0-DEV.395
Commit b7774095f6 (2019-06-11 21:06 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.0 (ORCJIT, skylake)

julia> @code_native directexp(a,b,c)
        .text
; ┌ @ REPL[48]:2 within `directexp'
        pushq   %rbp
        pushq   %r15
        pushq   %r14
        pushq   %r13
        pushq   %r12
        pushq   %rbx
        pushq   %rax
        movq    %rsi, (%rsp)
        movq    16(%rsi), %r14
; │┌ @ array.jl:200 within `length'
        movq    8(%r14), %r15
; │└
; │┌ @ range.jl:5 within `Colon'
; ││┌ @ range.jl:275 within `UnitRange'
; │││┌ @ range.jl:280 within `unitrange_last'
; ││││┌ @ operators.jl:341 within `>='
; │││││┌ @ int.jl:424 within `<='
        testq   %r15, %r15
; │└└└└└
        jle     L85
        movq    (%rsi), %r12
        movq    8(%rsi), %r13
        xorl    %ebx, %ebx
        movabsq $pow, %rbp
        nop
; │ @ REPL[48]:3 within `directexp'
; │┌ @ array.jl:728 within `getindex'
L48:
        movq    (%r12), %rax
        vmovsd  (%rax,%rbx,8), %xmm0    # xmm0 = mem[0],zero
        movq    (%r13), %rax
        vmovsd  (%rax,%rbx,8), %xmm1    # xmm1 = mem[0],zero
; │└
; │┌ @ math.jl:793 within `^'
        callq   *%rbp
; │└
; │┌ @ array.jl:766 within `setindex!'
        movq    (%r14), %rax
        vmovsd  %xmm0, (%rax,%rbx,8)
; │└
; │┌ @ range.jl:595 within `iterate'
; ││┌ @ promotion.jl:403 within `=='
        addq    $1, %rbx
        cmpq    %rbx, %r15
; │└└
        jne     L48
L85:
        movabsq $jl_system_image_data, %rax
; │ @ REPL[48]:3 within `directexp'
        addq    $8, %rsp
        popq    %rbx
        popq    %r12
        popq    %r13
        popq    %r14
        popq    %r15
        popq    %rbp
        retq
        nop
; └

3 Likes

Yeah, I saw that it was using a more or less direct LLVM call, but didn’t really know enough about it to go any further. It says I’m using libLLVM-6.0.1 (ORCJIT, skylake), so maybe it’s just the LLVM version difference?

EDIT:
I also noticed that log(x) and exp(x) are implemented with their own specialised algorithms in julia, rather than directly calling an LLVM function, which may explain why cobexp() and eulerexp() have been so consistent between people.

BTW, if you don’t need perfect accuracy, your function will run in 3.3μs and 6.1μs for Float32 and Float64 respectively using fastpow below. They’re pure Julia, so they should be fast regardless of LLVM.

# https://github.com/etheory/fastapprox/blob/master/fastapprox/src/fastlog.h
function fastlog2(x::Float32)::Float32
    y = Float32(reinterpret(Int32, x))
    y *= 1.1920928955078125f-7
    y - 126.94269504f0
end
function fastlog2(x::Float64)::Float32
   fastlog2(Float32(x))
end

# https://github.com/etheory/fastapprox/blob/master/fastapprox/src/fastexp.h
function fastpow2(x::Float32)::Float32
    clipp = x < -126.0f0 ? -126.0f0 : x
    clipp = min(126f0, max(-126f0, x))
    reinterpret(Float32, UInt32((1 << 23) * (clipp + 126.94269504f0)))
end
function fastpow2(x::Float64)::Float32
   fastpow2(Float32(x))
end

# https://github.com/etheory/fastapprox/blob/master/fastapprox/src/fastpow.h
function fastpow(x::Real, y::Real)::Real
    fastpow2(y * fastlog2(x))
end
1 Like

In economics it’s common for a large fraction of runtime to be spent calling the pow function (Ex: CRRA utility) so it would be great if there was a consistent way for it to be fast. Is compiling from source the way to go? Would a native sysimg suffice?

FWIW My Julia is compiled from github source

Thanks Rob, that’s been really helpful. I don’t have admin privileges on my work computer, so I doubt it’s gonna be convenient to install all the dependencies required to build Julia from source here, but I’ll give it a try on my home computer and report back if it makes a difference.

In practice the fastpow function is just on the border of being accurate enough, so I’ll have a play around with that (maybe compromise by using exp2(y*fastlog2(x)) instead of fastpow2(y*fastlog2(x)), which seems to be good enough for my purposes). In particular, I’d like my code to be reasonably portable, so the version/compilation dependence of the llvm.pow.f64 isn’t great for that, even if I get it running faster on my machine.

It does seem strange that such a fundamental operation would be so variable between installations, as I’d imagine quite a lot of codes would rely on performant exponentiation, which seems to be varying by a factor of ~5 between people.

In case you didn’t notice already, the references I noted in my prior post have two versions of the approximations, e.g., fastpow and fasterpow. Despite the naming in my Julia translation, I implemented the faster approximations, so if you’d like a little more accuracy, you could try the original fast versions.

1 Like

I am now working on a GE model with a lot of ^ in the equations (CES aggregator, production functions, CRRA utility; solution method is value iteration [because of finite lifetimes], then rootfinding for market clearing), so I was curious about this: according to profiling, apparently in my code about 9-12% is spent on ^. The code is otherwise heavily optimized.

It is of course always great to have things run faster, but I am not sure this is really an outstanding bottleneck. YMMV.

1 Like

I’m curious what do you get on the benchmark in the original post? For me directexp was several times slower than the others, but other people seem to have better optimization.

For a starker example, if I use the code available here https://github.com/jesusfv/Comparison-Programming-Languages-Economics/blob/master/RBC_Julia.jl but change the utility from log to CRRA, my timing goes from 1.3s to 5.5s. The number of iterations is about the same.

In a more complex model like yours it wouldn’t be as much of a bottleneck.

You’re not allowed to do git clone ... into your home folder? Julia downloads all depencies you need while compiling, AFAIK the only thing you really need is a compiler. If you’re working on a Windows, I don’t know how that works out

1 Like

I am getting

  20.319 μs (0 allocations: 0 bytes)
  26.976 μs (0 allocations: 0 bytes)
  8.874 μs (0 allocations: 0 bytes)

on a pre-1.2.

But I am not convinced that eulerexp is comparable to the other two, it is not using the a. Correcting that,

julia> @btime eulerexp(a,b,c)
  27.095 μs (0 allocations: 0 bytes)

so after much ado, both options would make my code slower.

1 Like

Other options…

julia> @noinline ThrowLengthMismatchError() = throw("Lengths aren't equal!")
ThrowLengthMismatchError (generic function with 1 method)

julia> function directexp!(c,a,b)
           N = length(a)
           N == length(b) == length(c) || ThrowLengthMismatchError()
           @inbounds for n in 1:N
               c[n] = a[n]^b[n]
           end
       end
directexp! (generic function with 1 method)

julia> function cobexp!(c,a,b)
           N = length(a)
           N == length(b) == length(c) || ThrowLengthMismatchError()
           @inbounds for n in 1:N
               c[n] = exp(b[n]*log(a[n]))
           end
       end
cobexp! (generic function with 1 method)

julia> syspower(a, b) = ccall(:pow,Float64,(Float64,Float64),a,b)
syspower (generic function with 1 method)

julia> function syspow!(c,a,b)
           N = length(a)
           N == length(b) == length(c) || ThrowLengthMismatchError()
           @inbounds for n in 1:N
               c[n] = syspower(a[n], b[n])
           end
       end
syspow! (generic function with 1 method)

julia> using SIMDPirates, LoopVectorization
[ Info: Recompiling stale cache file /home/chriselrod/.julia/compiled/v1.3/LoopVectorization/4TogI.ji for LoopVectorization [bdcacae8-1622-11e9-2a5c-532679323890]

julia> using SLEEFPirates, SLEEFwrap, xsimdwrap
[ Info: Recompiling stale cache file /home/chriselrod/.julia/compiled/v1.3/SLEEFwrap/GD9JX.ji for SLEEFwrap [6e748a96-c50b-11e8-2c8a-ed394f64ea02]
[ Info: Recompiling stale cache file /home/chriselrod/.julia/compiled/v1.3/xsimdwrap/NPDTn.ji for xsimdwrap [b9b7a223-629f-5757-a0fe-5fc90c39777c]

julia> # requires unregistered dependencies:
       # https://github.com/chriselrod/VectorizationBase.jl#master
       # https://github.com/chriselrod/SIMDPirates.jl#master
       # https://github.com/chriselrod/LoopVectorization.jl#master
       # https://github.com/chriselrod/SLEEFPirates.jl#master
       # https://github.com/chriselrod/SLEEFwrap.jl#master
       # https://github.com/chriselrod/xsimdwrap.jl#master


       @generated function jsleefpow!(c::AbstractArray{T}, a::AbstractArray{T}, b::AbstractArray{T}) where {T}
           quote
               N = length(a)
               N == length(b) == length(c) || ThrowLengthMismatchError()
               @vectorize $T for n in 1:N
                   c[n] = a[n] ^ b[n]
               end
           end
       end
jsleefpow! (generic function with 1 method)

julia> @generated function csleefpow!(c::AbstractArray{T}, a::AbstractArray{T}, b::AbstractArray{T}) where {T <: Union{Float32,Float64}}
           quote
               N = length(a)
               N == length(b) == length(c) || ThrowLengthMismatchError()
               @cvectorize $T for n in 1:N
                   c[n] = a[n] ^ b[n]
               end
           end
       end
csleefpow! (generic function with 1 method)

julia> @generated function xsimdpow!(c::AbstractArray{T}, a::AbstractArray{T}, b::AbstractArray{T}) where {T <: Union{Float32,Float64}}
           quote
               N = length(a)
               N == length(b) == length(c) || ThrowLengthMismatchError()
               @xvectorize $T for n in 1:N
                   c[n] = a[n] ^ b[n]
               end
           end
       end
xsimdpow! (generic function with 1 method)

julia> @generated function jsleefpowcob!(c::AbstractArray{T}, a::AbstractArray{T}, b::AbstractArray{T}) where {T}
           quote
               N = length(a)
               N == length(b) == length(c) || ThrowLengthMismatchError()
               @vectorize $T for n in 1:N
                   c[n] = exp(b[n] * log(a[n]))
               end
           end
       end
jsleefpowcob! (generic function with 1 method)

julia> @generated function csleefpowcob!(c::AbstractArray{T}, a::AbstractArray{T}, b::AbstractArray{T}) where {T <: Union{Float32,Float64}}
           quote
               N = length(a)
               N == length(b) == length(c) || ThrowLengthMismatchError()
               @cvectorize $T for n in 1:N
                   c[n] = exp(b[n] * log(a[n]))
               end
           end
       end
csleefpowcob! (generic function with 1 method)

julia> @generated function xsimdpowcob!(c::AbstractArray{T}, a::AbstractArray{T}, b::AbstractArray{T}) where {T <: Union{Float32,Float64}}
           quote
               N = length(a)
               N == length(b) == length(c) || ThrowLengthMismatchError()
               @xvectorize $T for n in 1:N
                   c[n] = exp(b[n] * log(a[n]))
               end
           end
       end
xsimdpowcob! (generic function with 1 method)

julia> a = rand(1000);

julia> b = rand(1000);

julia> c = zeros(1000);

julia> using BenchmarkTools

julia> @benchmark directexp!($c, $a, $b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     14.081 μs (0.00% GC)
  median time:      14.204 μs (0.00% GC)
  mean time:        14.240 μs (0.00% GC)
  maximum time:     24.483 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark cobexp!($c, $a, $b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     19.695 μs (0.00% GC)
  median time:      19.928 μs (0.00% GC)
  mean time:        19.970 μs (0.00% GC)
  maximum time:     30.733 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark syspow!($c, $a, $b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     12.908 μs (0.00% GC)
  median time:      12.998 μs (0.00% GC)
  mean time:        13.031 μs (0.00% GC)
  maximum time:     23.834 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark jsleefpow!($c, $a, $b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.537 μs (0.00% GC)
  median time:      6.562 μs (0.00% GC)
  mean time:        6.633 μs (0.00% GC)
  maximum time:     11.406 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     5

julia> @benchmark csleefpow!($c, $a, $b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.052 μs (0.00% GC)
  median time:      6.094 μs (0.00% GC)
  mean time:        6.112 μs (0.00% GC)
  maximum time:     9.819 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     5

julia> @benchmark xsimdpow!($c, $a, $b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.179 μs (0.00% GC)
  median time:      4.191 μs (0.00% GC)
  mean time:        4.202 μs (0.00% GC)
  maximum time:     6.713 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     7

julia> @benchmark jsleefpowcob!($c, $a, $b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.442 μs (0.00% GC)
  median time:      2.480 μs (0.00% GC)
  mean time:        2.481 μs (0.00% GC)
  maximum time:     22.302 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9

julia> @benchmark csleefpowcob!($c, $a, $b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.413 μs (0.00% GC)
  median time:      2.422 μs (0.00% GC)
  mean time:        2.435 μs (0.00% GC)
  maximum time:     4.540 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9

julia> @benchmark xsimdpowcob!($c, $a, $b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.735 μs (0.00% GC)
  median time:      3.746 μs (0.00% GC)
  mean time:        3.755 μs (0.00% GC)
  maximum time:     5.963 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     8


All times above were in microseconds.
My best times were through exp(b * log(a)) with SLEEF, and my worst through Base.exp(b * Base.exp(a)). My Base.:^ was faster. I also compiled Julia from source, with MARCH=native in my Make.user.

The vectorized functions were naturally the fastest. The clear winner, if you can stomach the loss of precision, is @robsmith11’s version (which vectorizes after replacing a UInt32(...) with unsafe_trunc(UInt32, ...).

julia> # https://github.com/etheory/fastapprox/blob/master/fastapprox/src/fastlog.h
       function fastlog2(x::Float32)::Float32
           y = Float32(reinterpret(Int32, x))
           y *= 1.1920928955078125f-7
           y - 126.94269504f0
       end
fastlog2 (generic function with 1 method)

julia> function fastlog2(x::Float64)::Float32
          fastlog2(Float32(x))
       end
fastlog2 (generic function with 2 methods)

julia> # https://github.com/etheory/fastapprox/blob/master/fastapprox/src/fastexp.h
       function fastpow2(x::Float32)::Float32
           clipp = x < -126.0f0 ? -126.0f0 : x
           clipp = min(126f0, max(-126f0, x))
           reinterpret(Float32, Base.unsafe_trunc(UInt32, (1 << 23) * (clipp + 126.94269504f0)))
       end
fastpow2 (generic function with 1 method)

julia> function fastpow2(x::Float64)::Float32
          fastpow2(Float32(x))
       end
fastpow2 (generic function with 2 methods)

julia> # https://github.com/etheory/fastapprox/blob/master/fastapprox/src/fastpow.h
       function fastpow(x::Real, y::Real)::Real
           fastpow2(y * fastlog2(x))
       end
fastpow (generic function with 1 method)

julia> function fastpow!(c,a,b)
           N = length(a)
           N == length(b) == length(c) || ThrowLengthMismatchError()
           @inbounds @simd ivdep for n in 1:N
               c[n] = fastpow(a[n], b[n])
           end
       end
fastpow! (generic function with 1 method)

julia> @benchmark fastpow!($c, $a, $b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     617.977 ns (0.00% GC)
  median time:      619.324 ns (0.00% GC)
  mean time:        629.481 ns (0.00% GC)
  maximum time:     829.486 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     173

I’d be worried that this could lead to convergence issues if you use these functions for calculating log densities and gradients for Hamiltonian Monte Carlo. Anyone know the level of accuracy required?
For now, I’ll stick with SLEEF/XSIMD, which still performed quite well and are accurate over a reasonable range.
I’ve been bitten by single precision before, when many Cholesky factorizations that were fine with double precision failed when I used single (under the idea vectorized code would run twice as fast), so I’m weary of going single-precision.

EDIT:
A look at the accuracy. I defined:

using Statistics
a = rand(1000);
b = rand(1000);
c = zeros(1000);
const C_EXACT = @. Float64( big(a) ^ big(b) );
mean_exact(c) = mean(c .== C_EXACT)
mean_approx(c) = mean(c .≈ C_EXACT)
function ulp_diff(x, y)
    x, y = minmax(x, y)
    est = round(Int, (y - x) / eps(x) )
    while true
        z = nextfloat(x, est)
        z == y && return est
        est += round(Int, (z - y) / eps(z))
    end
end
mean_error(c) = mean(ulp_diff.(c, C_EXACT))

The mean_error function gives the number of floating points numbers by which they’re off.

Now, rerunning the benchmarks, and comparing the accuracy with the rounded BigFloat answers (over an admittedly very narrow subset of pow's domain):

julia> @benchmark directexp!($c, $a, $b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     13.715 μs (0.00% GC)
  median time:      14.199 μs (0.00% GC)
  mean time:        14.338 μs (0.00% GC)
  maximum time:     227.472 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> mean_exact(c), mean_approx(c), mean_error(c)
(1.0, 1.0, 0.0)

julia> @benchmark cobexp!($c, $a, $b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     18.915 μs (0.00% GC)
  median time:      19.161 μs (0.00% GC)
  mean time:        19.280 μs (0.00% GC)
  maximum time:     251.692 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> mean_exact(c), mean_approx(c), mean_error(c)
(0.815, 1.0, 0.201)

julia> @benchmark syspow!($c, $a, $b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     13.079 μs (0.00% GC)
  median time:      13.146 μs (0.00% GC)
  mean time:        13.234 μs (0.00% GC)
  maximum time:     30.236 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> mean_exact(c), mean_approx(c), mean_error(c)
(1.0, 1.0, 0.0)

julia> @benchmark jsleefpow!($c, $a, $b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.635 μs (0.00% GC)
  median time:      6.651 μs (0.00% GC)
  mean time:        6.675 μs (0.00% GC)
  maximum time:     10.282 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     5

julia> mean_exact(c), mean_approx(c), mean_error(c)
(0.828, 1.0, 0.19)

julia> @benchmark csleefpow!($c, $a, $b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.036 μs (0.00% GC)
  median time:      6.063 μs (0.00% GC)
  mean time:        6.101 μs (0.00% GC)
  maximum time:     9.794 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     5

julia> mean_exact(c), mean_approx(c), mean_error(c)
(0.984, 1.0, 0.016)

julia> @benchmark xsimdpow!($c, $a, $b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.101 μs (0.00% GC)
  median time:      4.190 μs (0.00% GC)
  mean time:        4.182 μs (0.00% GC)
  maximum time:     6.699 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     7

julia> mean_exact(c), mean_approx(c), mean_error(c)
(0.69, 1.0, 0.325)

julia> @benchmark jsleefpowcob!($c, $a, $b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.440 μs (0.00% GC)
  median time:      2.449 μs (0.00% GC)
  mean time:        2.467 μs (0.00% GC)
  maximum time:     4.405 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9

julia> mean_exact(c), mean_approx(c), mean_error(c)
(0.673, 1.0, 0.35)

julia> @benchmark csleefpowcob!($c, $a, $b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.413 μs (0.00% GC)
  median time:      2.423 μs (0.00% GC)
  mean time:        2.427 μs (0.00% GC)
  maximum time:     4.557 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9

julia> mean_exact(c), mean_approx(c), mean_error(c)
(0.787, 1.0, 0.246)

julia> @benchmark xsimdpowcob!($c, $a, $b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.647 μs (0.00% GC)
  median time:      3.695 μs (0.00% GC)
  mean time:        3.690 μs (0.00% GC)
  maximum time:     5.989 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     8

julia> mean_exact(c), mean_approx(c), mean_error(c)
(0.69, 1.0, 0.325)

julia> @benchmark fastpow!($c, $a, $b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     617.277 ns (0.00% GC)
  median time:      644.266 ns (0.00% GC)
  mean time:        644.124 ns (0.00% GC)
  maximum time:     1.737 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     173

julia> mean_exact(c), mean_approx(c), mean_error(c)
ERROR: InexactError: trunc(Int64, NaN)
Stacktrace:
 [1] trunc at ./float.jl:696 [inlined]
 [2] round at ./float.jl:361 [inlined]
 [3] ulp_diff(::Float64, ::Float64) at ./REPL[61]:7
 [4] _broadcast_getindex_evalf at ./broadcast.jl:625 [inlined]
 [5] _broadcast_getindex at ./broadcast.jl:598 [inlined]
 [6] getindex at ./broadcast.jl:558 [inlined]
 [7] macro expansion at ./broadcast.jl:888 [inlined]
 [8] macro expansion at ./simdloop.jl:77 [inlined]
 [9] copyto! at ./broadcast.jl:887 [inlined]
 [10] copyto! at ./broadcast.jl:842 [inlined]
 [11] copy at ./broadcast.jl:818 [inlined]
 [12] materialize at ./broadcast.jl:798 [inlined]
 [13] mean_error(::Array{Float64,1}) at ./REPL[62]:1
 [14] top-level scope at REPL[84]:1

julia> mean_exact(c), mean_approx(c)
(0.0, 0.0)

The system pow (called using ccall(:pow,...)) and Julia’s default were the most numerically accurate. Unsurprisingly, exp(b * log(a)) was generally less accurate than pow.

4 Likes