Weird ldexp performance

I’m trying to write a faster implimentation of exp, and along the way, I noticed that I have no idea how fast ldexp is, and my efforts to benchmark it result in madness.

My code is

function myexp(x::Float64)
    # Reduction: x = M*ln2 + ln2*(j/1024) + r
    # M, j chosen such that 0<1024, abs(r) < ln2/2048
    N = round(Int, x*1477.3197218702985)      # 1024/ln2
    r = x - N*0.0006769015435155716           # ln2/1024
    M = N >> 10
    j = N - M<<10
    # exp(x) = exp(ln2*M + ln2*j/2^k + r)
    small_part = J_TABLE[j+1] * small_exp(r)
    return ldexp(small_part, M)

Unfortunately, this is about 2x slower than if the last line is replaced with return small_part* (1<<M). When I saw this, everything made sense. I then tried to benchmark the performance of ldexp by itself, and discovered that the function, no matter how I timed it (with BenchmarkTools) only took 15ns. Any ideas what’s happening here?

1 Like

Why is 15ns madness? That’s about what I’d expect. Can you give an MWE and show your benchmarking code?

Actually I get

julia> x = 1.0; y = 3

julia> @btime ldexp($(Ref(x))[], $(Ref(y))[])
  3.272 ns (0 allocations: 0 bytes)
1 Like

I’d stick with this. Compare the native code.
But it is possible to do even better if you can guarantee that M belongs to a narrow range (such as 0 <= M < 64):

julia> myldexp(x, y) = x * (1 << y)
myldexp (generic function with 1 method)

julia> @code_native debuginfo=:none myldexp(0.125, 3)
        movl    $1, %eax
        shlxq   %rdi, %rax, %rcx
        xorl    %edx, %edx
        cmpq    $63, %rdi
        cmovaq  %rdx, %rcx
        movq    %rdi, %rsi
        negq    %rsi
        shrxq   %rsi, %rax, %rax
        cmpq    $63, %rsi
        cmovaq  %rdx, %rax
        testq   %rdi, %rdi
        cmovnsq %rcx, %rax
        vcvtsi2sd       %rax, %xmm1, %xmm1
        vmulsd  %xmm0, %xmm1, %xmm0
        nopl    (%rax,%rax)

julia> using SIMDPirates

julia> myldexp(x, y) = (SIMDPirates.assume(0 ≤ y < 64); x * (1 << y))
myldexp (generic function with 1 method)

julia> @code_native debuginfo=:none myldexp(0.125, 3)
        movl    $1, %eax
        shlxq   %rdi, %rax, %rax
        vcvtsi2sd       %rax, %xmm1, %xmm1
        vmulsd  %xmm0, %xmm1, %xmm0
        nopw    %cs:(%rax,%rax)

But I think LLVM is good at figuring out restricted ranges on its own, so odds are assume is unnecessary.

My bench-marking code is as follows:

x = rand(1000);
@btime myexp.($x)

I find that the ldexp version takes 6.1us, vs 3.0us for the bit-shifted version. For the actuall problem, I don’t have any restrictions on the range of M, but if it helps, I know that small_part is around 1.