Sum of float64 vector gives slightly incorrect answer

I’ve noticed that when I run the following code

a = [0.00694 0.0023 -0.02064 0.01405]

but instead of the expected answer of 0.00265 I get 0.0026500000000000013
Normally this wouldn’t be a problem but this is happening a lot in my program and it changes the results by significant decimal points

Most decimal numbers cannot be represented exactly by a base 2 floating point number. Please read the following in the manual:

If you need to add decimal numbers exactly, you will need to represent your numbers differently. Eg with the DecFP package.

1 Like

Note also that this happens in every language, since every popular language uses the hardware (binary) floating-point arithmetic by default. You might not notice it in some languages, e.g. Matlab, simply because they print fewer digits by default.

Well, ANSI M/MUMPS (used by a large percentage of the major healthcare applications, especially in the US), uses decimal floating point, so things like a for loop can be done like this: For i = 0:0.3:0.1 { Write !,i } (the step is optional, and is last, not in the middle like Julia) would print "\n0\n0.1\n0.2\n0.3".
No worrying about string <-> floating point conversions.

As suggested above, a way to get higher accuracy is to use higher precision arithmetic for intermediate calculations.

For example, you can parse as BigFloat the string representations of the numbers, sum the BigFloats, then convert back to Float64:

julia> a = [0.00694 0.0023 -0.02064 0.01405]
1×4 Array{Float64,2}:
 0.00694  0.0023  -0.02064  0.01405

julia> Float64(sum(parse.(BigFloat, string.(a))))

This is far from being efficient, but that’s the cost for higher accuracy.

1 Like

Thanks for the responses. This inaccuracy only occurs every few iterations of my program, and as everything depends on previous iterations, the results after about 1000 iterations are very different from what they used to be when the vector was an Array{Float64, 2}

Maybe if I explain what I’m doing, there might be a simple fix. I have a matrix of asset returns with each row holding one days returns, so I use mu = transpose(mapslices(mean, exp_tup, [1])) to get the column means. This returns an Array{Float64,2}. I only got the floating point inaccuracy when converting the array to a vector using the code mu = transpose(mapslices(mean, exp_tup, [1]))[1:end] which I needed when using the mu in the quadprog function. However, I’ve changed how I go about this conversion and now only convert when calling the function like so: OptimSemiLog(mu[:,1], sigma, risk_aver_vec[L]). I haven’t checked if this results in the floating point inaccuracy but I noticed before it was only occuring when the number of rows = 4

OptimSemiLog is my function which just calls the QuadProg function

function OptimSemiLog(mu, Sigma, risk_aver)
    A = transpose(ones(size(mu)))
    sol = quadprog(mu, Sigma*risk_aver, A, '=', 1, 0, 1, IpoptSolver(print_level=0))
    return sol.sol

If I just left mu, I got a type error.

A further question, how would I go about detecting where these floating point inaccuracies occur? I only picked up on this happening by chance

Although it is in a package,, it will give you the results you expect, and is a good deal faster than BigFloat also. (It seemed about 3x faster overall, when I was benchmarking it a couple of years ago, and memory usage is much better).

Basically everywhere you use floating point, unless both the arguments and the result are exactly representable, which happens mostly by chance. It is best to just assume that this will happen, then deal with it in some appropriate manner. In this case, you could

  1. just print fewer digits (if it bothers the user),
  2. use isapprox to compare results,
  3. round, renormalize, etc.

It all depends on your problem.

Note that decimal floating-point arithmetic only eliminates roundoff errors for converting human decimal inputs into the computer’s representations. Arithmetic operations can still easily incur roundoff errors if the result is not exactly represented in decimal with the current precision.

The poster is apparently calling a quadratic-programming solver, which will almost certainly incur roundoff errors in both binary and decimal floating-point arithmetic. Also, for calling Ipopt you’ll need binary floating-point values anyway.

The key point here is that roundoff errors are inherent to floating-point arithmetic. This will happen in general regardless of whether binary or decimal is used, and regardless of the precision (Float64 or Dec128 or BigFloat). You need to get used to this and comfortable with its properties if you are going to do floating-point arithmetic (which is typically the only practical way to do large-scale computation) in any language.


Some math stuff I wasn’t aware of!

Does that function work with ArbFloat?
I think those are faster than BigFloat, and can be more accurate?

Just to be clear, I was referring to the fact that Ipopt is an external C++ library, and like most numerical C/C++ libraries (even in C++) I’m pretty sure it’s hard-coded only to use the hardware floating-point types (and is compiled in double precision by default). You can’t pass it any other floating-point type.


Precisely why we need more of those libraries rewritten in nice generic Julia!
Thanks for the clarifications!


Why isn’t DecFP ( IEEE 754-2008 ) used as the base/regular way to represent numbers in Julia?

I’m probably one of the people who most wanted/needed decimal floating point support (for databases) when I first started using Julia (and at-stevengj almost instantly wrapped the Intel library), but since most of the Julia users are looking for scientific/numerical/technical computing, I don’t think that would make sense.
(Having DecFP.jl in stdlib, but not loaded by default, I think might be a good idea, though)

Because TwoFP is the common case? With floating point in general you create a nonlinear system and DecFP only looks like more natural.

The nice thing about Julia is that it supports a wide variety of numeric types, ArbFloats, BigFloats, IEEE floats (both binary and decimal), Unums, and allows you to use them with other generic numeric types, such as Complex.
What other language allows you all that? :grinning:

1 Like

Because decimal floating-point is implemented only in software (on most CPUs), whereas binary floating-point is implemented in hardware and therefore inevitably much faster, no matter how cleverly you implement the decimal arithmetic.

And again, decimal floating point only eliminates one kind of roundoff error: roundoff in converting human (decimal) inputs into the computer’s type. This is extremely useful in some applications, e.g. finance! But the vast majority of technical computing performs additional calculations on the inputs, where you will almost inevitably incur roundoff errors even in decimal floating-point. For things like scientific modeling, statistics, and machine learning, there is essentially no benefit to decimal floating-point, and so performance considerations dominate.

It is quite easy to use decimal floating-point in your Julia code if you want it (the ChangePrecision package makes it even easier, for simple scripts), and for some applications it is crucial. But it is a mistake to view it as a panacea. If you do computing with non-integer values, you need to learn something about roundoff errors and computer arithmetic.


Seems like it’s time for a “PSA: floating-point arithmetic”, like Thomas_Papp’s “PSA: How to quote code with backticks”. I’ll write one unless someone beats me to it.


Does that work for decimal floating-point correctly? I remember a problem I had when I started with Julia, I found out that the femto-lisp parser converts floating point literals to Float64, and doesn’t keep the original string. It would be good if Meta.parse (possibly only with a keyword argument) would wrap literals (at least those that were not “canonic”) with an Expr that has the converted value along with the original string parsed.

Mostly yes.

ChangePrecision works by taking float literals (which are indeed stored in the AST as Float64) and printing them to a string, then re-parsing this string in the desired numeric type. In almost all cases, the grisu algorithm used to print floating-point values will print something equivalent to the original decimal input string, since it prints the shortest string that parses to the same value.

For example, 0.3 parses to a Float64 value that is not quite 0.3, but it prints as "0.3", so ChangePrecision will preserve this exact value when converting to Dec64, or for BigFloat it will give you BigFloat("0.3"). However, it is not 100% reliable — in the unlikely event that you enter the literal 0.099999999999999999, ChangePrecision will treat it as if you had typed "0.1" (== repr(0.099999999999999999)).

ChangePrecision is convenient for quick hacks, I think — taking an existing script and quickly experimenting with another precision. But in the long run, you should strive to write libraries of code that work for any precision (indeed, any Number type), and if you need absolute guarantees of representing human inputs exactly you should probably use DecFP (or similar) directly.