PSA: floating-point arithmetic

Sometimes people are surprised by the results of floating-point calculations such as

julia> 5/6                
0.8333333333333334       # shouldn't the last digit be 3?

julia> 2.6 - 0.7 - 1.9   
2.220446049250313e-16    # shouldn't the answer be 0?

These are not bugs in Julia. They’re consequences of the IEEE-standard 64-bit binary representation of floating-point numbers that is burned into computer hardware, which Julia and many other languages use by default.

Brief explanation

You can think of 64-bit floating point numbers like binary scientific notation with 52 bits for the mantissa, 11 bits for the exponent, and 1 bit for the sign. The 52-bit mantissa means that floating-point numbers between 1 and 2 are spaced 2^{-52} \approx 10^{-16} apart. When you type in a decimal number or perform an arithmetic operation, in general the result must be rounded to the nearest floating-point number.

The 5/6 example above has a trailing 4 because the 64-bit float that is closest to 5/6 prints as 0.8333333333333334 when converted back to decimal and printed to all 16 significant digits.

In the 2.6 - 0.7 - 1.9 example, none of those three decimal numbers has an exact 64-bit binary representation, so each must be rounded to the nearest 64-bit float before doing any arithmetic. Further, since the 64-bit floating-point number system is not closed under arithmetic operations, the results of arithmetic operations generally usually must be rounded to the nearest 64-bit float as well. All these rounding operations produce a result that is close to but not exactly zero. The strange-looking number that should be zero is actually 2^{-52} -the spacing between floating-point numbers in the range from 1 to 2.

julia> 2.6 - 0.7 - 1.9   

julia> 2.0^-52

It might seem you could fix these problems by switching to a base-10 floating-point number system. This would eliminate initial rounding errors when converting from decimal, but the fix would be totally superficial. Most arithmetic operations occur on the results from previous calculations. Any exactness due to lack of initial rounding error would be lost after just a few calculations. And the cost doing floating-point arithmetic in decimal, in hardware or software, is big.

The take-home message is that any finite computer representation of the real number system has finite precision, so wanting exact answers is unrealistic. For 64-bit floating-point numbers, you cannot expect conversions from decimal or arithmetic operations to be accurate to more than 16 digits.

More detail: 5/6

Still glossing over a lot of subtlety (overflow./underflow, subnormals, etc.), a 64-bit float is a number of the form

\pm (1 + m \: 2^{-52}) \: 2^n

where m is an integer between 0 and 2^{52}-1 (occupying 52 bits), n is an integer between -2^{10} and 2^{10}-1 (occupying 11 bits), and the remaining 1 bit is used for the sign.

In the 5/6 example, the closest 64-bit float to 5/6 has m=3002399751580331 and n=-1. You can see this by computing the errors of nearby floats in higher precision. (I truncated digits from the output to make it easier to read.)

julia> abs((1 + 3002399751580330 * big(2.0)^-52) * 2.0^-1 - 5//6)

julia> abs((1 + 3002399751580331 * big(2.0)^-52) * 2.0^-1 - 5//6)

julia> abs((1 + 3002399751580332 * big(2.0)^-52) * 2.0^-1 - 5//6)

The middle number, with the smallest error, has 4 in the 16th digit when printed in decimal.

julia> (1 + 3002399751580331 * 2.0^-52) * 2.0^-1

You can see where the 4 comes from by computing in higher precision.

julia> (1 + 3002399751580331 * big(2.0)^-52) * 2.0^-1

If you’ve never noticed this behavior before, it might be because some languages hide the issue by not showing the last significant digit. Matlab, I’m looking at you!

>> 5/6                     # Matlab prints only 15 digits

julia> 5/6                 # Julia prints all 16 significant digits

More detail: 2.6 - 0.7 - 1.9

Here’s it’s more illuminating to take an abstract perspective. Let fl : \mathbb{R} \rightarrow F represent the function that rounds a real number x \in \mathbb{R} to the nearest floating-point number \tilde{x} \in F. The rounding operation obeys

\tilde{x} = fl(x) = x \: (1 + \epsilon) \text{ for some } \epsilon \text{ with } |\epsilon| < \epsilon_{m}

where \epsilon_m is the machine precision for the given floating-point number system. For 64-bit floats, \epsilon_m = 2^{-52}. Floating-point arithmetic requires rounding as well. For example, the difference between two floating-point numbers is generally in \mathbb{R} and not F. So computers implement floating-point operations that approximate operations over the reals, essentially by carrying out the operation over the reals and then rounding to the nearest floating-point value.

For example, the floating-point subtraction operation \ominus : F^2 \rightarrow F approximates - : \mathbb{R}^2 \rightarrow \mathbb{R} according to

x \ominus y = fl(x - y)
~~~~~~~~~= (x - y)(1 + \epsilon) \text{ for some } \epsilon \text{ with } |\epsilon| < \epsilon_{m}

Letting x,y, z= 2.6, 0.7, 1.9, the computer calculation of 2.6 - 0.7 - 0.9 is really

(\tilde{x} \ominus \tilde{y}) \ominus \tilde{z} = (x \: (1 + \epsilon_1) \ominus y \: (1 + \epsilon_2)) \ominus z \: (1 + \epsilon_3)

~~~~~~~~~~ = ( (x \: (1 + \epsilon_1) - y \: (1 + \epsilon_2) )(1 + \epsilon_4) - z \: (1 + \epsilon_3))( 1 + \epsilon_5)

for some \epsilon_i's bounded in magnitide by \epsilon_m. If you expand this and keep only first-order terms in the \epsilon_i's, you get

(\tilde{x} \ominus \tilde{y}) \ominus \tilde{z} = (x - y - z) + 3|x| \epsilon_6 + 3|y| \epsilon_7 + 2 |z| \epsilon_8

for some new \epsilon_i's bounded in magnitude by \epsilon_m. The first term (x - y - z) evaluates to zero, but the error terms are order-1 numbers times order \epsilon_m = 2^{-52} numbers. So (\tilde{x} \ominus \tilde{y}) \ominus \tilde{z} evaluates to something on the order of 2^{-52}, not zero.

Lastly, note that Julia and similar languages don’t do this arithmetic themselves, they just pass it on to the host computer’s hardware and its floating-point instructions. Any language that parses 2.6 - 0.7 - 1.9 as (2.6 - 0.7) - 1.9 and computes it using the same floating-point hardware gives the same answer. On my Intel i7-3960x x86-64 CPU, all these languages give the same result

julia> 2.6 - 0.7 - 1.9  # julia 0.6.2

>> 2.6 - 0.7 - 1.9      # matlab r2018a
ans =

>>> 2.6 - 0.7 - 1.9     # python 2.7.13

2.220446049250313e-16   # C compiled from printf("%1.16e\n", 2.6 - 0.7 - 1.9);

Nice writeup! Perhaps you could add a brief mention to prevfloat and nextfloat: it would be then less surprising for a user to see that a number x is approximated by \tilde{x} if they can see that x \in [\mathrm{prevfloat}(\tilde{x}), \mathrm{nextfloat}(\tilde{x})]. And explain that this doesn’t hold after performing operations, because rounding errors accumulate (well, one could use IntervalArithmetic.jl to check the bounds of the result).


I think though you should make it clear when decimal floating point is critically important (i.e. for example with monetary calculations - in fact, there are even laws in the EU about how currency calculations should be done).


For example, the floating-point subtraction operation \ominus : F^2 \rightarrow F approximates - : \mathbb{R}^2 \rightarrow \mathbb{R} according to x \ominus y = (x - y)(1 + \epsilon) \text{ for some } \epsilon \text{ with } |\epsilon| < \epsilon_{m}

This is true, but it would perhaps be more informative to say that x \ominus y = fl(x - y) (“exact rounding”) : it computes the result as if x-y were computed exactly and then rounded to the nearest floating-point value.

The weaker property that you give is often the one given in numerical-analysis textbooks (e.g. Trefethen), because it is the minimum that you need to prove most of the theorems, but I think that the stronger guarantee of exact rounding provided by IEEE 754 is a lot easier to understand.


Great writeup! Extending @giordano’s suggestion about prevfloat and nextfloat, you could also mention other functions from Base.Math, eg significand and exponent, and perhaps use them in the examples that decompose the floats.


Thanks for these suggestions. I’ll incorporate them as soon as I can, just editing the original post.

It’s so easy to illustrate how floats work when you can so easily swtich and compare different floating-point types (especially BigFloat) and dive into details with all the floating-point introspection functions. It’s really helped my teaching. Thanks, Julia.


Somewhat naive question: There is this horrible horrible x86 extended precision. Does julia ever use this?

See e.g., Is there anything I need to do to be safe from this? I have never seen julia emit the 80 bit stuff; I guess it is never used, but I’d be happy about reassurance.

Context: I have a pure inlined float function; I need it to reliably produce identical outputs, regardless of context where it is inlined in (and I already make sure that I never use it from a @fastmath context).

No, on x86 we only use SSE instructions (80-bit floats are only available with x87 instructions).

Note there are still some ways you might get non-deterministic results:

  • BLAS or LAPACK operations
  • @fastmath: which lets the compiler do a lot of manipulations
  • @simd: which allows re-association of arithmetic operations to exploit SIMD operations.
  • muladd: allows use of either a*b+c or fma(a,b,c) depending on which is faster. This is a tricky one, as we’re increasingly making use of it (such as the recent libm work).

Thanks! I feel suitably reassured now.

A post was split to a new topic: Fixed point decimals for accounting

I wouldn’t call it horrible. The article describes a bug in PHP, not in x87. As far as I’m concerned, 80 bits was a good idea, because better to have more bits. Of course, you need the language to support them correctly.

By the way, the great floating point expert William Kahan was part of a proposal to have Java support extended precision, although I believe they were ultimately unsuccessful.

The point is somewhat moot today. As indicated by @simonbyrne, x87 was superceded by SSE. But nothing wrong with 80.


I think for Java they successfully added support for 80-bit floats?! Anyway, for Julia, I was pretty sure there was a package adding such support, but I can’t track it down. It should be the fastest option for Intel compatible, of options for that much (or more) precision (while not working on e.g. ARM), still slower than Float64, that can be SIMD-enabled. I would really look into other options, because even if I had found that package, it would be platform dependent (or possibly revert to Float64, then presicion not to be relied upon), unlike other better options:

This might be the fastest large-precision package, with more than 80-bit, and cross-platform (would be a bit slower than 80-bit, or possibly not since I guess could be SIMD-enabled?):

Two new keywords were added to JDK 1.2 with the beta 4 release: strictfp and widefp . These keywords let you specify the IEEE 754-standard strictness that a method or class uses when calculating intermediate results. Currently, these options are not used by the JVM, but you can place them in your code. The strictfp keyword acts like the current float behavior, while widefp is an extended-precision format, which may be faster.

I’m not sure what to read into this “fixed” issue (i.e. “Currently, these options are not used by the JVM” above outdated?):
compiler incorrectly accepts strictfp/widefp constructor

I would really consider other options than more precision, such as interval packages, or even posits (valids), if you’re adventurous:


Maybe you were thinking of BitFloats.jl ?


Yes, from memory it’s likely it (is there any other implementation?).

Thanks for making this (as it’s non-trivial), even with it outdated to use 80-bit floats (I see Java is retiring it), also at the time you made it…

Looking at the source code seeing UInt80 used (non-exported), and LLVM bitcast bitcode, and generated code with @code_native, I’m not sure this is as optimized as could be, e.g. as C/C++ compilers would do historically (they could need to use the, by now outdated, float stack but not convert from ints, since handling float types as fully native).

FYI: Java no longer supports “slow old x87 FPU” as of at least Java 15.
“The non-strict environment accommodates certain peculiarities of performing 32-bit float and 64-bit double calculations on the 80-bit registers of the x87 stack”

The proposed specification changes are low-risk, mostly deleting text and updating the floating-point overviews in JLS and JVMS. The platform’s original semantics are restored and strict-only was always an allowable implementation option.

I was a bit surprised by: “We’re sorry the site has closed.” Java is of course still supported by “Java SE 16.0.1 is the latest release for the Java SE Platform” and Java 17 now in rampdown phase…