Do you have a reference for which algo this implements versus what Base uses? Or is it the same thing just tweaking the Julia implementation to be faster?
Most of the speed difference is that these algorithms only have 1 branch in the normal case. (The branch is for answers with subnormal, 0, Inf or Nan outputs). This both speeds things up, but also allows for automatic vectorization.
For Float32, the algorithm is pretty similar to base, except it uses a 7 term polynomial instead of a much shorter rational function.
The Float64 version is fairly different. It uses a table with 256 elements, to reduce the range the polynomial kernel needs to approximate. This allows a much smaller polynomial to maintain full accuracy.
The Float64 version also has a bunch of minor tricks to get full precision with minimal use of double-double arithmetic. The two main tricks are storing extra bits in the table, and having the kernel approximate expm1 instead of exp.
Can you share how the coefficients for the expm1b_kernel have been chosen? For the ::Val{ℯ} case, I can recognize off-hand that those aren’t quite the Taylor series coefficients, so I’m guessing they’ve been numerically optimized in some way.
(I’ve been silently following the Julia PR, and I got sucked into trying to understand where all the constants were coming from. So far I have not been able to get a minimax optimization to reproduce those specific coefficients, so I’m curious if there’s some extra constraints I haven’t been able to reverse engineer yet.)
The coefficients are minimax polynomials obtained by Remez.jl. The thing that’s a little tricky is that since I want the first coefficient to be 0, I actually approximate x -> x == 0 ? one(x) : expm1(x)/x. Then to make it so the error minimization is absolute error with respect to expm1, you need to weight the error function by a factor of x. Thus the full way to generate the weights (for ::Val{ℯ}) ends up being
One thing worth looking into (especially for Float32) is if rounding these coefficients means that the error they generate increases significantly. If so, it might be worth doing an exhaustive search of nearby coefficients to see if any combination gives better results.
OK, thank you for the clarification. I see now that given your weight function, that’s a minimax of the absolute error — I’d been playing around with minimaxing the relative error (so my weight function was x -> iszero(x) ? one(x) : x / expm1(x)). I also coded up my own minimax algorithm (as a way to teach myself the principles), so I’m guessing part of the remaining residual difference is just some implementation details on my part.
They don’t give exactly the same results, but the Float64 versions have a maximum error of .76 ULPs compared to .88 ULP for base. For Float32, my versions have a max error of 1 ULP compared to .9 for base.
Does interval arithmetic provide an easy way to get the maximum error of a polynomial (in ULPs)? If so, how would I do it? I’m thinking of trying an exhaustive search for better polynomials than the rounded minimax polynomials, but exhaustive checking is too slow.
Do you mean get a faster method for expm1(x)? I don’t have one done yet, but plan to. The reason expm1b kernel isn’t good enough is that it’s designed to only be accurate enough when you add 1, which wipes out a lot of the least significant bits (at least 10). Since expm1 has outputs close to 0, it is much harder to approximate within 1 ULP.