Difference between f(1.0e-7) & f(1.0e-8)

the limit of the following function at 0 is 1,


yet f(1.0e-8) gives 0 while f(1.0e-7) gives the correct answer, why?

julia> f(x)=2*(1-x*exp(-x)-exp(-x))/(1-exp(-x))^2
f (generic function with 1 method)

julia> f(1.0e-7)

julia> f(1.0e-8)

The problem is catastrophic cancellation due to limited accuracy of double precision floating point arithmetic:

julia> 1 - (1e-7) * exp(-1e-7)

julia> exp(-1e-7)

julia> 1 - (1e-7) * exp(-1e-7) - exp(-1e-7)

julia> 1 - (1e-8) * exp(-1e-8)

julia> exp(-1e-8)

julia> 1 - (1e-8) * exp(-1e-8) - exp(-1e-8)

Note also that your f is probably quite wrong also for x = 1e-7:

julia> f(big"1e-7")

Using textbook formulas involving very small numbers is often wrong in numerical computing.


The following implementation computes your g(\delta) accurately in double precision, by switching over to a Taylor series for small \delta and exploiting the expm1 function for larger \delta.

function g(δ::Union{Float64,ComplexF64})
    if abs(δ) < 0.2
        return 1 + δ*@evalpoly(δ^2, 1/3,-1/90,1/2520,-1/75600,1/2395008,-691/54486432000)
        y = expm1(-δ)
        return -2(y + δ*exp(-δ)) / y^2

(It’s possible that there is some clever algebraic transformation that lets you avoid the explicit Taylor series, but I’m not seeing it at the moment.)


you can also take the fact of the existence of expm1 and expanding each term:

f1(x)= 2*(1-x*exp(-x)-exp(-x))/(1-exp(-x))^2 #for comparison
function f2(x)
    denom= -expm1(-x)
    return 2/denom+ 2*x/denom- 2*x/(denom*denom)

Testing for equality on other number

julia> f1(4)

julia> f2(4)

near zero:

julia> f1(1e-7)

julia> f2(1e-7)

julia> f2(1e-8)

julia> f1(1e-8)

to be fair it also fails, but around 1e-11


Often times for numerical computing, in the field I was in, we would either rigorously determine, or crudely determine limits for significance/guard digits. Rules of thumb for floats were usually 1e-5, for doubles usually 1e-10 (because we used ffts and things a lot).

Main take away is, there’s error in every calculation you do basically. There’s also a finite resolution to what a digital object can store/handle. When you do lots of calculations that error can propagate, or do unexpected things.


Does this means that in any serious calculations where human lives are at stake, you need to do the calculations twice? Once in Float64 and once in Float32. Then compare the results to make sure they come up with roughly the same value in case there are any numerical instabilities?

1 Like

Well I’m not so sure about the answer to that. Another thing that can be done, instead, would be to plot a range of solutions near the answer, and test results of known similar conditions.

I mostly did work like this. When something goes awry you’ll be able to tell most of the time. Complex real and imag values become underdetermined you’ll see steppy swaps. Really though you can either measure or find the floating point error for most calculations. Julia is actually amazing for this… I need to write a paper about it - but have negative time to do so. I have tons of results and code for it though…

When someone has done a careful analysis, this is not necessary—in some cases one can prove that the result has a certain accuracy.


thanks all of you for such detailed explanation and diverse solutions

Although it is not entirely foolproof, this method is likely to work well in practice. For a rather extensive discussion about such techniques, see for example this paper by Kahan:

The discussion about possible round-off error analysis schemes and Kahan’s opinion about them starts at page 13.
The first of the five schemes listed in this paper is the one you mention: running the same computation several times with different FP precisions. The second scheme (Kahan’s favorite) is very similar: it consists in running the same computation several times with the same FP precision, but with different rounding modes.


Sometimes numerical work may inform decisions and influence actions relevant to another’s health, wealth, or contentment. I have found it better to compute more accurately than may obtain using Float32 or Float64, then converting the likely more accurate result to Float32 or Float64. The choice of computational type and the choice of presentational type are made with an understanding of the specifics of the work and its context.

There are some publicly available Julia packages designed for this work. For your purposes, I suggest becoming familiar with DoubleFloats.jl. Where algorithmic choices, data uncertainties, or natural inexactness suggest characterizing the results as spans, The types ArbReal and ArbComplex from ArbNumerics.jl do just that. Also see packages from JuliaIntervals.


I saw your talk on this!!!

1 Like

You can also use TaylorSeries.jl to do Taylor series manipulations:

julia> using TaylorSeries

julia> n = 16

julia> t = Taylor1(Rational{Int}, n)
 1//1 t + 𝒪(t¹⁷)

julia> ee = sum( (-t)^i / factorial(i) for i in 0:n )
 1//1 - 1//1 t + 1//2 t² - 1//6 t³ + 1//24 t⁴ - 1//120 t⁵ + 1//720 t⁶ - 1//5040 t⁷ + 1//40320 t⁸ - 1//362880 t⁹ + 1//3628800 t¹⁰ - 1//39916800 t¹¹ + 1//479001600 t¹² - 1//6227020800 t¹³ + 1//87178291200 t¹⁴ - 1//1307674368000 t¹⁵ + 1//20922789888000 t¹⁶ + 𝒪(t¹⁷)

julia> 2 * (1 - t * ee - ee) / ((1 - ee)^2)
 1//1 + 1//3 t - 1//90 t³ + 1//2520 t⁵ - 1//75600 t⁷ + 1//2395008 t⁹ - 691//54486432000 t¹¹ + 1//2668723200 t¹³ - 59513//156920924160000 t¹⁵ - 42397//94152554496000 t¹⁶ + 𝒪(t¹⁷)

(Note that although exp(t) is defined for a Taylor series, it currently converts the arguments to floats.)

You can also bound the remainder using TaylorModels.jl.

1 Like

That’s a neat writeup of very relevant problems and methods from a giant in the field, but it is surprising how jaded and bitter the tone of the article is.

1 Like