Question about floating-point precision in summation

Hello everyone, I have a “technical” (probably trivial) question.

My code computes a lot of matrices (calling an external library in C) by parallelizing on multiple processes, then finally adds these matrices and we get the final result.

Currently, in order not to create race conditions, all the matrices are stored in a multidimensional array (a series of matrices placed “one on top of the other”) shared between the various processes, and each process works on a different matrix.

Each process, that is, each worker, computes a different matrix, and accumulates it in the multidimensional array previously described.

All good, were it not for the fact that some of the computed numbers (i.e. some matrix elements) are very small (10 ^ (- 30) approximately), and this causes a considerable loss of precision in the final result, if we declare the multidimensional array as Float64.

I’m sure the problem is precision because, using a single process and using multithreading, I can declare the multidimensional array as BigFloat and in fact the final result is perfect just because the precision is much higher.

The problem is that julia does not allow to declare shared arrays of type BigFloat … unfortunately I cannot use threads because (for some reason) the allocated memory grows enormously, while with processes this does not happen (the code is perfectly identical and I still don’t understand why).

How can I do? That is, is there a way to use shared arrays but maintain a sufficiently high precision in the sums?

Thank you so much and have a nice day everyone!

1 Like

you could try Kahan summation, e.g. with KahanSummation.jl
That wouild be something like sum_khn.(matrixes) to broadcast over each element.

But if each worrker only works on a single matrix and never touches the matrixes another worrker is using, I am not sure that you need a SharedArray.
Maybe you can just send them all back to the main process at the end?

What do you mean exactly? Sorry but I just started using julia

Multiprocessing is tricky and context-dependent - it would help if you could include a boiled-down working example of your code to demonstrate exactly what you’re trying to do.

If you require BigFloat precision in order to get acceptable accuracy, it could be that you are using a numerically unstable algorithm, and you’d be better off with some other algorithm. Impossible to say without more information about what you are doing, though.


I have just seen that there are some packages available, such as ParallelDataTransfer.jl. or ParallelOperations.jl.

If I am lucky, I will be able to create a BigFloat matrix in every single process and accumulate separately, then transmit the values back to the main process.

I hope this will work. Thank you very much for trying to help me and I wish everyone a good day!

One thing to try is using a library like DoubleFloats.jl for a type that will be much more accurate than Float64, but much faster than BigFloat


Could you please give us more details about what you actually compute? As stated in previous comments, there might exist several solutions to this kind of problems:

  • using larger floats (e.g. DoubleFloats or BigFloats)
  • using compensated summation algorithms (e.g. Kahan-Babuska-Neumaier or Ogita-Rump-Oishi, as implemented in KahanSummation.jl and/or AccurateArithmetic.jl)
  • changing your overall algorithm to be more stable

The “right” approach depends on you problem:

  • What do you want to compute exactly?
  • Are you summing only positive numbers?
  • If not (i.e. if numbers can be of different signs), what are the condition numbers of your sums? \displaystyle\left(C= \frac{\sum_i \left\vert x_i\right\vert}{\left\vert\sum_i x_i\right\vert}\right)
  • How many elements are there in each sum (i.e. in your case, how many matrices are you summing together)?

For example I can parallelise operations with pmap, sum(pmap(f, xs))
or with @distributed

@distributed + for x in xs

both of which are in the Distributed standard library

Note that Xsum.jl is actually faster than Float64 compensated summation and is much more accurate — it is equivalent to summing in infinite precision and then rounding the exact sum to Float64 at the end. By the same token, there’s no need to bother with BigFloat or DoubleFloat etcetera just to do sums accurately; Xsum will be better.

But I continue to think that if you find yourself needing arbitrary precision in a practical computational setting there is a high probability that you are making a mistake in how you formulated your algorithm. (Or you are simply misunderstanding accuracy, i.e. thinking that a tiny roundoff error is more important than it actually is.)


Or solving a problem that is inherently ill-conditioned, in which case going to higher precision makes sense. A minimal working example would really help clarify what the root problem is.

In the original problem formulation, does “getting the final result” mean solving some linear algebra problem using the computed matrix? What is the condition number of the matrix?

Xsum is indeed much more accurate than any compensated summation. It is also impressively fast (especially considering the gain in accuracy), however not faster than compensated algorithms in my experience. At least not when the numbers to be summed are nicely aligned in a vector: the results shown in AccurateArithmetic.jl’s Readme illustrate how fast compensated algorithms can be when it is possible to use a SIMD-vectorized implementation. (Which may not be the case here)

I also mentioned a comparison of the performances of Base.sum, AccurateArithmetic.sum_oro and Xsum.xsum in my talk at JuliaCon last summer:

I entirely agree on this. The best way to solve such issues is to make them disappear entirely when possible. This may involve changes in the mathematical formulation of the problem, or changes in the resolution algorithm itself.


Usually, in a practical setting, if your problem is inherently ill-conditioned then you are solving the wrong problem — ill-conditioning means that the outputs vary wildly with tiny fractional changes in the input, which often means that the answers are useless in the face of uncertainties and you need to re-think things (e.g. by using some regularization).

It’s more common to see situations where the overall problem you are trying to solve is well-conditioned, but an intermediate step of your algorithm is ill-conditioned, or well-conditioned but numerically unstable, in which cases you can typically avoid the problem by changing your algorithm.


Good point, thanks; I was comparing to the non-SIMD algorithm in KahanSummation.jl.

1 Like

Edit: I see people arguing that the algorithm may be wrong in the event that arbitrary precision is required.

No guys, the algorithm is right and it is formulated in the most sensible way that came to mind to me and my collaborators. I didn’t post it here because it would be too messed up to explain, and without a knowledge of my research field you wouldn’t even understand the meaning of this algorithm.
If arbitrary precision was not useful (especially when you do a lot of operations on very small numbers, as in my case), they wouldn’t have invented it, and they wouldn’t have developed so many packages.

Thanks to everyone who commented and gave me advice, but I have already solved it by eliminating the need for a shared array and working with bigfloat multidimensional arrays on separate processes. In this way the final result is very precise.

I leave the topic open because interesting discussions have arisen and that could potentially help someone else.

Thanks again to everyone and I wish you a great day.

1 Like

The talk was very interesting. Your function “dot_oro” might be very useful for me


Note that you probably won’t be able to use dot_oro immediately, since (if I understand your code correctly), you want to add matrices together rather than scalars, so that the elements you want to sum are not nicely stored together in a vector.

As @stevengj mentioned, you might want to use Xsum in this case, since AFAIK it works with any iterable collection. So that would probably require very few modifications to your current code.

Another technique could be to perform a parallel reduction like @oxinabox suggested above. A MWE of a compensated+distributed reduction re-using AccurateArithmetic.jl's internals could look like:

using Distributed

@everywhere begin
    using AccurateArithmetic: two_sum

    # This type will be used to implement compensated summation
    struct Accumulator
        high :: Float64   # First-order term in the result
        low  :: Float64   # Error term

    # Various methods needed for the parallel reduction
    function add(a::Number, b::Number)
        hi, lo = two_sum(a, b)
        Accumulator(hi, lo)

    function add(a::Accumulator, b::Number)
        hi, lo = two_sum(a.high, b)
        Accumulator(hi, a.low+lo)

    function add(a::Accumulator, b::Accumulator)
        hi, lo = two_sum(a.high, b.high)
        Accumulator(hi, a.low + b.low + lo)

    add(a::AbstractArray, b::AbstractArray) = add.(a, b)

    # In the end, reduce an `Accumulator` to a single `Float64`
    # by summing both components
    reduce(a::Accumulator) = a.high + a.low

With that in place, the computation itself could be as simple as

# Parallel reduction using the custom `add` function
julia> acc = @distributed add for i in 1:100
           ones(3,3) * sin(0.02pi*i)
3×3 Matrix{Accumulator}:
 Accumulator(-4.88498e-15, 5.63682e-15)  Accumulator(-4.88498e-15, 5.63682e-15)  Accumulator(-4.88498e-15, 5.63682e-15)
 Accumulator(-4.88498e-15, 5.63682e-15)  Accumulator(-4.88498e-15, 5.63682e-15)  Accumulator(-4.88498e-15, 5.63682e-15)
 Accumulator(-4.88498e-15, 5.63682e-15)  Accumulator(-4.88498e-15, 5.63682e-15)  Accumulator(-4.88498e-15, 5.63682e-15)

# `acc` computed above is an array of `Accumulator`s
# -> reduce it to an array of `Float64`
julia> reduce.(acc)
3×3 Matrix{Float64}:
 7.51836e-16  7.51836e-16  7.51836e-16
 7.51836e-16  7.51836e-16  7.51836e-16
 7.51836e-16  7.51836e-16  7.51836e-16

Note that for this to work with Julia v1.6, you’ll probably have to use a version of AccurateArithmetic that incorporates PR#24 by @Elrod (which I haven’t merged yet; sorry about that!):

using Pkg
pkg"add AccurateArithmetic#a39fd8696e"

But again, and at the risk of being too heavy handed here, I would like to second @stevengj’s advice above. I obviously do find arbitrary/increased precision useful enough to work on it myself. However, please consider whether the ill-conditioning of your problem really comes from the (physical?) system you’re trying to simulate, or whether it comes from one of the mathematical modeling steps which lead to the solution algorithm that you’re implementing. If you’re in the latter case, it does not mean that you or your colleagues did a bad job devising an algorithm that suits your need. But you might want to see this as an indication that it might be rewarding to put additional work in the mathematical modeling and/or algorithm, rather than focusing on the accurate implementation of the algorithm.

In my experience, any improvement made in the early stages of the whole simulation process (i.e. in the mathematical modelling or algorithm stages) is more likely to have more beneficial impact than a change in the implementation itself.

That being said, I’m fully aware that investigating such things takes time. Time that you might want to devote to other things (and rightfully so).


Note that Xsum will be both faster and more accurate than using big floats.


Amazing! That’s a lot of useful informations, thank you. As I said, there is no way (in my computation) to avoid adding a lot of terms that differ by many orders of magnitude, but I’ll keep your advice for future projects.

BigFloats are useful in some contexts, but that does not mean that they are a panacea. As you have discovered, they come with their own trade-offs.

It is unclear what “very small” means in your context, and whether that is the source of the error. Eg Float64 has a 11-bit exponent and can go down to 10^{-308}-ish without subnormals, and if scale is the only problem then simply rescaling could work.