Weird ldexp performance

I’m trying to write a faster implimentation of `exp`, and along the way, I noticed that I have no idea how fast `ldexp` is, and my efforts to benchmark it result in madness.

My code is

``````function myexp(x::Float64)
# Reduction: x = M*ln2 + ln2*(j/1024) + r
# M, j chosen such that 0<1024, abs(r) < ln2/2048

N = round(Int, x*1477.3197218702985)      # 1024/ln2
r = x - N*0.0006769015435155716           # ln2/1024
M = N >> 10
j = N - M<<10

# exp(x) = exp(ln2*M + ln2*j/2^k + r)
small_part = J_TABLE[j+1] * small_exp(r)
return ldexp(small_part, M)
end
``````

Unfortunately, this is about 2x slower than if the last line is replaced with `return small_part* (1<<M)`. When I saw this, everything made sense. I then tried to benchmark the performance of `ldexp` by itself, and discovered that the function, no matter how I timed it (with BenchmarkTools) only took 15ns. Any ideas what’s happening here?

1 Like

Why is 15ns madness? That’s about what I’d expect. Can you give an MWE and show your benchmarking code?

Actually I get

``````julia> x = 1.0; y = 3
3

julia> @btime ldexp(\$(Ref(x))[], \$(Ref(y))[])
3.272 ns (0 allocations: 0 bytes)
8.0
``````
1 Like

I’d stick with this. Compare the native code.
But it is possible to do even better if you can guarantee that `M` belongs to a narrow range (such as `0 <= M < 64`):

``````julia> myldexp(x, y) = x * (1 << y)
myldexp (generic function with 1 method)

julia> @code_native debuginfo=:none myldexp(0.125, 3)
.text
movl    \$1, %eax
shlxq   %rdi, %rax, %rcx
xorl    %edx, %edx
cmpq    \$63, %rdi
cmovaq  %rdx, %rcx
movq    %rdi, %rsi
negq    %rsi
shrxq   %rsi, %rax, %rax
cmpq    \$63, %rsi
cmovaq  %rdx, %rax
testq   %rdi, %rdi
cmovnsq %rcx, %rax
vcvtsi2sd       %rax, %xmm1, %xmm1
vmulsd  %xmm0, %xmm1, %xmm0
retq
nopl    (%rax,%rax)

julia> using SIMDPirates

julia> myldexp(x, y) = (SIMDPirates.assume(0 ≤ y < 64); x * (1 << y))
myldexp (generic function with 1 method)

julia> @code_native debuginfo=:none myldexp(0.125, 3)
.text
movl    \$1, %eax
shlxq   %rdi, %rax, %rax
vcvtsi2sd       %rax, %xmm1, %xmm1
vmulsd  %xmm0, %xmm1, %xmm0
retq
nopw    %cs:(%rax,%rax)
nop
``````

But I think LLVM is good at figuring out restricted ranges on its own, so odds are `assume` is unnecessary.

My bench-marking code is as follows:

``````x = rand(1000);
@btime myexp.(\$x)
``````

I find that the ldexp version takes 6.1us, vs 3.0us for the bit-shifted version. For the actuall problem, I don’t have any restrictions on the range of M, but if it helps, I know that `small_part` is around `1`.