# PSA: floating-point arithmetic

#1

Sometimes people are surprised by the results of floating-point calculations such as

julia> 5/6
0.8333333333333334       # shouldn't the last digit be 3?

julia> 2.6 - 0.7 - 1.9
2.220446049250313e-16    # shouldn't the answer be 0?


These are not bugs in Julia. They’re consequences of the IEEE-standard 64-bit binary representation of floating-point numbers that is burned into computer hardware, which Julia and many other languages use by default.

### Brief explanation

You can think of 64-bit floating point numbers like binary scientific notation with 52 bits for the mantissa, 11 bits for the exponent, and 1 bit for the sign. The 52-bit mantissa means that floating-point numbers between 1 and 2 are spaced 2^{-52} \approx 10^{-16} apart. When you type in a decimal number or perform an arithmetic operation, in general the result must be rounded to the nearest floating-point number.

The 5/6 example above has a trailing 4 because the 64-bit float that is closest to 5/6 prints as 0.8333333333333334 when converted back to decimal and printed to all 16 significant digits.

In the 2.6 - 0.7 - 1.9 example, none of those three decimal numbers has an exact 64-bit binary representation, so each must be rounded to the nearest 64-bit float before doing any arithmetic. Further, since the 64-bit floating-point number system is not closed under arithmetic operations, the results of arithmetic operations generally usually must be rounded to the nearest 64-bit float as well. All these rounding operations produce a result that is close to but not exactly zero. The strange-looking number that should be zero is actually 2^{-52} -the spacing between floating-point numbers in the range from 1 to 2.

julia> 2.6 - 0.7 - 1.9
2.220446049250313e-16

julia> 2.0^-52
2.220446049250313e-16


It might seem you could fix these problems by switching to a base-10 floating-point number system. This would eliminate initial rounding errors when converting from decimal, but the fix would be totally superficial. Most arithmetic operations occur on the results from previous calculations. Any exactness due to lack of initial rounding error would be lost after just a few calculations. And the cost doing floating-point arithmetic in decimal, in hardware or software, is big.

The take-home message is that any finite computer representation of the real number system has finite precision, so wanting exact answers is unrealistic. For 64-bit floating-point numbers, you cannot expect conversions from decimal or arithmetic operations to be accurate to more than 16 digits.

### More detail: 5/6

Still glossing over a lot of subtlety (overflow./underflow, subnormals, etc.), a 64-bit float is a number of the form

\pm (1 + m \: 2^{-52}) \: 2^n

where m is an integer between 0 and 2^{52}-1 (occupying 52 bits), n is an integer between -2^{10} and 2^{10}-1 (occupying 11 bits), and the remaining 1 bit is used for the sign.

In the 5/6 example, the closest 64-bit float to 5/6 has m=3002399751580331 and n=-1. You can see this by computing the errors of nearby floats in higher precision. (I truncated digits from the output to make it easier to read.)

julia> abs((1 + 3002399751580330 * big(2.0)^-52) * 2.0^-1 - 5//6)
7.401486830834377...e-17

julia> abs((1 + 3002399751580331 * big(2.0)^-52) * 2.0^-1 - 5//6)
3.700743415417188...e-17

julia> abs((1 + 3002399751580332 * big(2.0)^-52) * 2.0^-1 - 5//6)
1.480297366166875...e-16


The middle number, with the smallest error, has 4 in the 16th digit when printed in decimal.

julia> (1 + 3002399751580331 * 2.0^-52) * 2.0^-1
0.8333333333333334


You can see where the 4 comes from by computing in higher precision.

julia> (1 + 3002399751580331 * big(2.0)^-52) * 2.0^-1
8.3333333333333337034076748750521801412105560302734375...e-01


If you’ve never noticed this behavior before, it might be because some languages hide the issue by not showing the last significant digit. Matlab, I’m looking at you!

>> 5/6                     # Matlab prints only 15 digits
0.833333333333333

julia> 5/6                 # Julia prints all 16 significant digits
0.8333333333333334


### More detail: 2.6 - 0.7 - 1.9

Here’s it’s more illuminating to take an abstract perspective. Let fl : \mathbb{R} \rightarrow F represent the function that rounds a real number x \in \mathbb{R} to the nearest floating-point number \tilde{x} \in F. The rounding operation obeys

\tilde{x} = fl(x) = x \: (1 + \epsilon) \text{ for some } \epsilon \text{ with } |\epsilon| < \epsilon_{m}

where \epsilon_m is the machine precision for the given floating-point number system. For 64-bit floats, \epsilon_m = 2^{-52}. Floating-point arithmetic requires rounding as well. For example, the difference between two floating-point numbers is generally in \mathbb{R} and not F. So computers implement floating-point operations that approximate operations over the reals, essentially by carrying out the operation over the reals and then rounding to the nearest floating-point value.

For example, the floating-point subtraction operation \ominus : F^2 \rightarrow F approximates - : \mathbb{R}^2 \rightarrow \mathbb{R} according to

x \ominus y = fl(x - y)
~~~~~~~~~= (x - y)(1 + \epsilon) \text{ for some } \epsilon \text{ with } |\epsilon| < \epsilon_{m}

Letting x,y, z= 2.6, 0.7, 1.9, the computer calculation of 2.6 - 0.7 - 0.9 is really

(\tilde{x} \ominus \tilde{y}) \ominus \tilde{z} = (x \: (1 + \epsilon_1) \ominus y \: (1 + \epsilon_2)) \ominus z \: (1 + \epsilon_3)

~~~~~~~~~~ = ( (x \: (1 + \epsilon_1) - y \: (1 + \epsilon_2) )(1 + \epsilon_4) - z \: (1 + \epsilon_3))( 1 + \epsilon_5)

for some \epsilon_i's bounded in magnitide by \epsilon_m. If you expand this and keep only first-order terms in the \epsilon_i's, you get

(\tilde{x} \ominus \tilde{y}) \ominus \tilde{z} = (x - y - z) + 3|x| \epsilon_6 + 3|y| \epsilon_7 + 2 |z| \epsilon_8

for some new \epsilon_i's bounded in magnitude by \epsilon_m. The first term (x - y - z) evaluates to zero, but the error terms are order-1 numbers times order \epsilon_m = 2^{-52} numbers. So (\tilde{x} \ominus \tilde{y}) \ominus \tilde{z} evaluates to something on the order of 2^{-52}, not zero.

Lastly, note that Julia and similar languages don’t do this arithmetic themselves, they just pass it on to the host computer’s hardware and its floating-point instructions. Any language that parses 2.6 - 0.7 - 1.9 as (2.6 - 0.7) - 1.9 and computes it using the same floating-point hardware gives the same answer. On my Intel i7-3960x x86-64 CPU, all these languages give the same result

julia> 2.6 - 0.7 - 1.9  # julia 0.6.2
2.220446049250313e-16

>> 2.6 - 0.7 - 1.9      # matlab r2018a
ans =
2.220446049250313e-16

>>> 2.6 - 0.7 - 1.9     # python 2.7.13
2.220446049250313e-16

2.220446049250313e-16   # C compiled from printf("%1.16e\n", 2.6 - 0.7 - 1.9);


A_mul_B! deviates from C = A*B with complex numbers
Inverse of a symmetric matrix is not symmetric?
Float comparison error
1 * X' * X is not symmetric
Different results for cholesky decomposition between 0.6.2 and 0.6.3
#2

Nice writeup! Perhaps you could add a brief mention to prevfloat and nextfloat: it would be then less surprising for a user to see that a number x is approximated by \tilde{x} if they can see that x \in [\mathrm{prevfloat}(\tilde{x}), \mathrm{nextfloat}(\tilde{x})]. And explain that this doesn’t hold after performing operations, because rounding errors accumulate (well, one could use IntervalArithmetic.jl to check the bounds of the result).

#3

I think though you should make it clear when decimal floating point is critically important (i.e. for example with monetary calculations - in fact, there are even laws in the EU about how currency calculations should be done).

#4

For example, the floating-point subtraction operation \ominus : F^2 \rightarrow F approximates - : \mathbb{R}^2 \rightarrow \mathbb{R} according to x \ominus y = (x - y)(1 + \epsilon) \text{ for some } \epsilon \text{ with } |\epsilon| < \epsilon_{m}

This is true, but it would perhaps be more informative to say that x \ominus y = fl(x - y) (“exact rounding”) : it computes the result as if x-y were computed exactly and then rounded to the nearest floating-point value.

The weaker property that you give is often the one given in numerical-analysis textbooks (e.g. Trefethen), because it is the minimum that you need to prove most of the theorems, but I think that the stronger guarantee of exact rounding provided by IEEE 754 is a lot easier to understand.

#5

Great writeup! Extending @giordano’s suggestion about prevfloat and nextfloat, you could also mention other functions from Base.Math, eg significand and exponent, and perhaps use them in the examples that decompose the floats.

Wrong output from det(Array{Complex})
#6

Thanks for these suggestions. I’ll incorporate them as soon as I can, just editing the original post.

It’s so easy to illustrate how floats work when you can so easily swtich and compare different floating-point types (especially BigFloat) and dive into details with all the floating-point introspection functions. It’s really helped my teaching. Thanks, Julia.

#7

Somewhat naive question: There is this horrible horrible x86 extended precision. Does julia ever use this?

See e.g. http://www.exploringbinary.com/php-hangs-on-numeric-value-2-2250738585072011e-308/, http://www.exploringbinary.com/why-volatile-fixes-the-2-2250738585072011e-308-bug/. Is there anything I need to do to be safe from this? I have never seen julia emit the 80 bit stuff; I guess it is never used, but I’d be happy about reassurance.

Context: I have a pure inlined float function; I need it to reliably produce identical outputs, regardless of context where it is inlined in (and I already make sure that I never use it from a@fastmath context).

typeof(Int32(1)/Int32(2)) == Float64 @_@
#8

No, on x86 we only use SSE instructions (80-bit floats are only available with x87 instructions).

Note there are still some ways you might get non-deterministic results:

• BLAS or LAPACK operations
• @fastmath: which lets the compiler do a lot of manipulations
• @simd: which allows re-association of arithmetic operations to exploit SIMD operations.
• muladd: allows use of either a*b+c or fma(a,b,c) depending on which is faster. This is a tricky one, as we’re increasingly making use of it (such as the recent libm work).

#9

Thanks! I feel suitably reassured now.