Julia equivalent of Python's "fsum" for floating point summation

Radford’s C code has now been posted under an MIT license: gitlab.com/radfordneal/xsum

15 Likes

Radford’s C code is designed for summing C doubles. We could create an interface to the C code, and that would work summing many Float64s accurately. Or, we could reimplement it in Julia an allow for summing T<:AbstractFloat. The maximum number of floats summable by the small superaccumulator will vary with the precision of the values being summed.

2 Likes

Also, in a Julia implementation the size of the accumulator could be a type parameter, making it possible to use less memory when the magnitude of the sum is known to be within a given range.

Having read Radford’s paper, I’m worried that the accumulator will completely fill the L1 cache and therefore be slower in practice than the benchmarks show…

His superaccumulator is 536 bytes for double precision (67 64-bit chunks). L1 cache sizes are typically ~32kB.

1 Like

The paper discusses two different superaccumulators, where the “fast” one is 4096 x 64 bit (16 kB).

1 Like

It also argues that most of these entries are not used for typical data with limited range. In any case, it claims at most a factor of two benefit from the “fast” method vs. the “small” method; I suspect that most cases requiring an exactly rounded sum are not performance-critical enough for this to matter. It would be nice to have an implementation of at least the “small” method in a Julia package for people who need exactly rounded floating-point sums.

1 Like

This is exactly what I was referring to. It’d be easy to write an implementation of the “fast” method with user-selectable range and memory footprint.

2 Likes

I put together a Julia wrapper for his C code at:

I was too lazy to re-implement it from scratch, but I think this should be sufficient for most needs: it provides an xsum function that does exact double-precision (Float64 and ComplexF64) accumulation over arbitrary iterators (with an optimized method for contiguous arrays).

It seems to be quite fast — for an array of 10⁴ elements, it is actually about the same speed as a naive sum without SIMD. With @simd (which is enabled for the built-in sum) it is about 10x slower, since Neal’s code does not seem to exploit SIMD.

julia> function naivesum(a)
           s = zero(eltype(a))
           for x in a
               s += x
           end
           return s
       end
mysum (generic function with 1 method)

julia> a = randn(10^4);

julia> using Xsum, BenchmarkTools

julia> @btime sum($a);
  1.107 μs (0 allocations: 0 bytes)

julia> @btime xsum($a);
  12.272 μs (0 allocations: 0 bytes)

julia> @btime naivesum($a);
  12.175 μs (0 allocations: 0 bytes)

julia> err(s, exact) = abs(s - exact) / abs(exact) # relative error
err (generic function with 1 method)

julia> exact = Float64(sum(big.(a))); err(sum(a), exact), err(xsum(a), exact), err(naivesum(a), exact)
(1.5635496798094023e-16, 0.0, 6.097843751256669e-15)

On my machine, Neal’s “large” accumulator starts being faster than the “small” accumulator around n=256.

9 Likes

Nice work! That’s impressive speed for such a complex summation algorithm. I have to wonder if SIMD could make it even more competitive.

The algorithm has a lot of data-dependent memory addressing, which would result in gather / scatter operations that are detrimental to SIMD performance. Also, consecutive inputs may or may not affect the same part of the accumulator, making it tricky to handle them in parallel.

Maybe one could develop a fast SIMD implementation that only works on a very limited range of exponents (a user-selected offset ± N) and send any numbers outside of this range to the slower algorithm. The entire accumulation buffer should then fit in a SIMD vector.

What about a SPMD-style implementation?

That is, if you have a vector x, and the computer’s SIMD width is W, you perform W xsum calculations. The first sum is of elements x[1], x[1+W], x[1+2W],…, while the second sum is of x[2], x[2+W], x[2+2W], etc.
Each of these W sums is calculated serially. But we’re calculating W in parallel.
No need for gather/scatter.

The hard part would be keeping it correctly rounded while adding the remainder (length(x) % W) and reducing those W sums, but I’d guess that’s possible – although I am unfamiliar with the xsum algorithm, so I may be wrong.

It would still be gather/scatter because of the indexing into the W accumulators.

For fun, I tested with more difficult data. (randn gives a very small range of exponents.)

julia> a = 5000 |> N->randn(N) .* exp.(50 .* randn(N)) |> x->[x;-x] |> x->x[sortperm(rand(length(x)))];

julia> @btime xsum($a)
  22.596 μs (0 allocations: 0 bytes)
0.0

This takes about twice as long as with the simple randn data, presumably due to more cache misses.
The vector sums to zero by construction (which xsum gets right of course.)

For comparison, sum(big.(a)) is more than an order of magnitude slower, and does not give the exact answer using the default 256-bit precision:

julia> @btime sum(big.($a))
  776.165 μs (20004 allocations: 1.14 MiB)
0.1244434958829871841314559402473363089103990532787747361444057905702476659871387
1 Like

I’m not sure it is ever useful to convert to BigFloat numbers randomly generated for another precision.

2 Likes

On my machine, xsum is faster than sum_kbn (KahanSummation.jl) at ~800 values. For those who may not know, sum_kbn is quite a bit better than naive_sum at accurate summation, but, unlike xsum, is not assured to give the correct sum.

Ah, I take it you mean the algorithm uses different accumulators based on the magnitude?
In that case, yes.

I tried to vectorize Kahan summation, but I think I need to make the reduction more accurate: I’m losing a lot of accuracy compared to sum_kbn from KahanSummation.jl.

julia> using VectorizationBase, SIMDPirates, KahanSummation, BenchmarkTools

julia> function sum_kahan(x::AbstractArray{T}) where {T <: Union{Float32,Float64}}
           px = pointer(x)
           W, shift = VectorizationBase.pick_vector_width_shift(T)
           sizeT = sizeof(T)
           WT = W * sizeT
           WT4 = 4WT
           N = length(x)
           Base.Cartesian.@nexprs 4 u -> begin
               s_u = SIMDPirates.vbroadcast(Vec{W,T}, zero(T))
               c_u = SIMDPirates.vbroadcast(Vec{W,T}, zero(T))
           end
           Nshift = N >> (shift + 2)
           for n in 0:Nshift-1
               Base.Cartesian.@nexprs 4 u -> begin
                   x_u = SIMDPirates.vload(Vec{W,T}, px + n*WT4 + (u-1)*WT)
                   t_u = SIMDPirates.evadd(s_u, x_u)
                   m_u = SIMDPirates.vless(SIMDPirates.vabs(x_u), SIMDPirates.vabs(s_u))
                   subfrom_u = SIMDPirates.vifelse(m_u, s_u, x_u)
                   addto_u = SIMDPirates.vifelse(m_u, x_u, s_u)
                   c_u = SIMDPirates.evsub(c_u, SIMDPirates.evadd(SIMDPirates.evsub(subfrom_u, t_u), addto_u))
                   s_u = t_u
               end
           end
           
           rem = N & (4W-1)
           offset = Nshift * WT4
           for n in 0:(rem >> shift) - 1
               x_1 = SIMDPirates.vload(Vec{W,T}, px + offset)
               t_1 = SIMDPirates.evadd(s_1, x_1)
               m_1 = SIMDPirates.vless(SIMDPirates.vabs(x_1), SIMDPirates.vabs(s_1))
               subfrom_1 = SIMDPirates.vifelse(m_1, s_1, x_1)
               addto_1 = SIMDPirates.vifelse(m_1, x_1, s_1)
               c_1 = SIMDPirates.evsub(c_1, SIMDPirates.evadd(SIMDPirates.evsub(subfrom_1, t_1), addto_1))
               s_1 = t_1
               offset += WT
           end

           rem &= (W-1)
           if rem > 0
               mask = unsafe_trunc(VectorizationBase.mask_type(T), 2^rem - 1)
               
               x_2 = SIMDPirates.vload(Vec{W,T}, px + offset, mask)
               t_2 = SIMDPirates.evadd(s_2, x_2)
               m_2 = SIMDPirates.vless(SIMDPirates.vabs(x_2), SIMDPirates.vabs(s_2))
               subfrom_2 = SIMDPirates.vifelse(m_2, s_2, x_2)
               addto_2 = SIMDPirates.vifelse(m_2, x_2, s_2)
               c_2 = SIMDPirates.evsub(c_2, SIMDPirates.evadd(SIMDPirates.evsub(subfrom_2, t_2), addto_2))
               s_2 = t_2
           end

           c_1 = SIMDPirates.vadd(c_1, c_2)
           c_3 = SIMDPirates.vadd(c_3, c_4)
           c_1 = SIMDPirates.vadd(c_1, c_3)

           s_1 = SIMDPirates.vadd(s_1, s_2)
           s_3 = SIMDPirates.vadd(s_3, s_4)
           s_1 = SIMDPirates.vadd(s_1, s_3)
           SIMDPirates.vsum(SIMDPirates.vsub(s_1,c_1))
       end
sum_kahan (generic function with 1 method)

julia> x = 10^14 .* sin.((0.001:0.001:1) .* 2π);

julia> @btime Float64(sum(big, $x))
  91.590 μs (3998 allocations: 218.64 KiB)
1.1248384929460264

julia> @btime sum($x)
  43.253 ns (0 allocations: 0 bytes)
0.49223069682955295

julia> @btime sum_kbn($x)
  1.241 μs (0 allocations: 0 bytes)
1.1248384929460276

julia> @btime sum_kahan($x)
  151.435 ns (0 allocations: 0 bytes)
1.124755859375

It’s a lot more accurate than pairwise summation, and much faster than sum_kbn, but given that sum_kbn’s answer is correct and sum_kahan’s is fairly off, I’d probably pick the former (if I’m expecting to sum vectors like x).

julia> relative_error(x, y) = (x - y) / y

julia> relative_error(sum_kbn(x), Float64(sum(big, x)))
9.87006607248485e-16

julia> relative_error(sum_kahan(x), Float64(sum(big, x)))
-7.346260956097278e-5

julia> relative_error(sum(x), Float64(sum(big, x)))
-0.562398779988078

Any suggestions on how to improve the accuracy of sum_kahan?

Kahan summation fails for Per’s vector, while pairwise gets it right:

julia> a = 5000 |> N->randn(N) .* exp.(50 .* randn(N)) |> x->[x;-x] |> x->x[sortperm(rand(length(x)))];

julia> @btime Float64(sum(big, $a))
  1.052 ms (39998 allocations: 2.14 MiB)
1.40737488355328e14

julia> @btime sum($a)
  985.308 ns (0 allocations: 0 bytes)
0.0

julia> @btime sum_kbn($x)
  1.241 μs (0 allocations: 0 bytes)
1.1912085629266485e-5

julia> @btime sum_kahan($x)
  147.680 ns (0 allocations: 0 bytes)
1.1913478374481201e-5

julia> setprecision(1024); # 256 isn't good enough

julia> @btime sum(big, $a)
  1.172 ms (39998 allocations: 3.97 MiB)
0.0
2 Likes

Is this reproducible on your machine, for various arrays a?

I would tend to think you got (un)lucky on this one:

julia> a = 5000 |> N->randn(N) .* exp.(50 .* randn(N)) |> x->[x;-x] |> x->x[sortperm(rand(length(x)))];

julia> sum(a)
-3.064991081731778e54

julia> sum(big, a)
-1.430511474609375e-06

julia> sum(Rational{BigInt}, a)
0//1
julia> a = 5000 |> N->randn(N) .* exp.(50 .* randn(N)) |> x->[x;-x] |> x->x[sortperm(rand(length(x)))];

julia> sum(a)
2.0892633586169203e54

julia> sum(big, a)
5.9304284150130115449428558349609375e-07

julia> sum(Rational{BigInt}, a)
0//1
1 Like

Yes. Changing the 50 to 5 inside of exp will significantly recuce the likelihood of getting the right sum by accident.

1 Like

I don’t think your problem comes from the final reduction, but rather from the vectorized summation itself.

After making many tests (most of which making no sense whatsoever), I have come to a strong feeling that LLVM is breaking some of the compensation during its optimization phase.

The most minimal example I could come up with in order to demonstrate the issue is using the Priest summation algorithm (which I find a bit easier to reason about than Kahan, but it might be because I’m much more used to it).

EDIT: this algorithm seems to have been introduced by Ogita, Rump and Oishi in 2005. I incorrectly attributed it to Priest, whose summation algorithm is doubly compensated

First, the scalar version:

function sum_priest_scal(x)
    s = 0.
    e = 0.
    for xi in x
        s, ei = twoSum(s, xi)
        e += ei
    end

    s + e
end

function twoSum(a::Float64, b::Float64)
    x = a+b
    z = x - a
    e = (a - (x-z)) + (b-z)
    (x, e)
end

we can check that it does work quite well on your input data:

relative_error(x, y) = (x - y) / y

function check_fun(f, vec_size=1000)
    x = 10^14 .* sin.((0.001:0.001:1) .* 2π)
    x = x[1:vec_size]
    v = f(x)
    v, relative_error(v, Float64(sum(big, x)))
end
julia> check_fun(sum_priest_scal)
(1.1248384929460276, 9.87006607248485e-16)

Now a (naively) vectorized version which, for the sake of simplicity, only works for vectors whose size is a multiple of the vector pack size

using VectorizationBase, SIMDPirates

function sum_priest_vec(x::AbstractArray{T}) where {T <: Union{Float32,Float64}}
    W = VectorizationBase.pick_vector_width(T)

    s = SIMDPirates.vbroadcast(Vec{W,T}, zero(T))
    e = SIMDPirates.vbroadcast(Vec{W,T}, zero(T))

    offset = 1
    while offset+W <= length(x)+1
        xi = SIMDPirates.vload(Vec{W,T}, x, offset)
        s,ei = twoSum(s, xi)
        # Comment or Uncomment the following line to exhibit different behavior
        # @show ei
        e = SIMDPirates.vadd(e, ei)

        offset += W
    end

    # If the array size is not a multiple of W, we should handle the remainder here
    @assert offset == length(x)+1

    s = SIMDPirates.vadd(s, e)
    SIMDPirates.vsum(s)
end

function twoSum(a, b)
    x = SIMDPirates.vadd(a, b)
    z = SIMDPirates.vsub(x, a)
    e = SIMDPirates.vadd(
        SIMDPirates.vsub(a, SIMDPirates.vsub(x, z)),
        SIMDPirates.vsub(b, z)
    )
    (x, e)
end

this version can only be tested on a part of your input vector, which then becomes much more well-conditioned. We still observe some loss of accuracy w.r.t. the scalar version:

julia> check_fun(sum_priest_scal, 996)
(3.7697623564331396e12, 0.0)

julia> check_fun(sum_priest_vec,  996)
(3.769762356433375e12, 6.243140554957531e-14)

Some of this accuracy can be retrieved by uncommenting the @show line above in sum_priest_vec:

julia> check_fun(sum_priest_vec,  996)
# [... lots of output ...]
(3.76976235643314e12, 1.2952573765472056e-16)

I suspect that this is due to the additional optimizations that are allowed by SIMDPirates.jl in comparison to SIMD.jl. For example, the same algorithm as above implemented with SIMD.jl does not exhibit a different behavior than the scalar version.

# (in an other session, since SIMD and SIMDPirates conflict)
using VectorizationBase, SIMD
function sum_priest_vec2(x::AbstractArray{T}) where {T <: Union{Float32,Float64}}
    W = VectorizationBase.pick_vector_width(T)

    s = Vec{W,T}(0)
    e = Vec{W,T}(0)

    offset = 1
    while offset+W <= length(x)+1
        xi = SIMD.vload(Vec{W,T}, x, offset)
        s,ei = twoSum(s, xi)
        e += ei

        offset += W
    end

    # If the array size is not a multiple of W, we should handle the remainder here
    @assert offset == length(x)+1

    sum(s+e)
end

function twoSum(a::Vec, b::Vec)
    x = a+b
    z = x - a
    e = (a - (x-z)) + (b-z)
    (x, e)
end
julia> check_fun(sum_priest_vec2, 996)
(3.7697623564331396e12, 0.0)

I have not yet implemented fully vectorized versions (accounting for the remainder) in order to check what happens with the full 1000-element vector, and look at the performances.

2 Likes

Here is a complete vectorized (but maybe naive) implementation, correctly handling remainders. I took the opportunity to correct the attributions in my previous post: the algorithm I was using was introduced by Ogita, Rump & Oishi in 2005. It is equivalent to Kahan-Babuska, except the Error-Free Transform at its core is the “twoSum” algorithm by Knuth. In contrast, KB uses the “fastTwoSum” algorithm by Dekker, which needs a test to sort its arguments and may therefore be less suitable for vectorization (although @Elrod seems to know quite a few tricks to still implement this efficiently)

In any case, here is the full code:

1. Simple, scalar implementation of “Sum2” by Ogita, Rump and Oishi

import VectorizationBase
using SIMD

# D. E. Knuth, The Art of Computer Programming: Seminumerical Algorithms, 1969.
function two_sum(a::T, b::T) where T <: Union{Real, Vec}
    x = a + b
    z = x - a
    e = (a - (x-z)) + (b-z)
    (x, e)
end


# T. Ogita, S. Rump and S. Oishi, "Accurate sum and dot product",
# SIAM Journal on Scientific Computing, 6(26), 2005.
# DOI: 10.1137/030601818
function sum_oro_scal(x)
    s = 0.
    e = 0.
    for xi in x
        s, ei = two_sum(s, xi)
        e += ei
    end

    s + e
end

2. Vectorized implementation of the same algorithm using SIMD.jl and a vectorization technique similar to that of @Elrod above (thanks, I learned quite a lot!)

# T. Ogita, S. Rump and S. Oishi, "Accurate sum and dot product",
# SIAM Journal on Scientific Computing, 6(26), 2005.
# DOI: 10.1137/030601818
function sum_oro_vec(x::AbstractArray{T}) where T <: Union{Float32, Float64}
    W = VectorizationBase.pick_vector_width(T)
    N = length(x)

    Base.Cartesian.@nexprs 4 u -> begin
        s_u = Vec{W,T}(0)
        e_u = Vec{W,T}(0)
    end

    offset = 1

    # Partially unrolled loop by batches of 4 packs of width W
    while offset+4W-1 <= N
        Base.Cartesian.@nexprs 4 u -> begin
            xi = SIMD.vload(Vec{W,T}, x, offset)
            s_u, ei = two_sum(s_u, xi)
            e_u += ei
            offset += W
        end
    end

    # Remainder by packs of width W
    while offset+W-1 <= N
        xi = SIMD.vload(Vec{W,T}, x, offset)
        s_1, ei = two_sum(s_1, xi)
        e_1 += ei
        offset += W
    end

    # Reduction: (I'm guessing what to do here)
    # - always compensate summation of high-order terms
    # - error terms are added naively

    # Stage 1:   s_1 += s_2 + s_3 + s_4
    s_1, ei = two_sum(s_1, s_3)
    e_1 += ei

    s_2, ei = two_sum(s_2, s_4)
    e_2 += ei

    s_1, ei = two_sum(s_1, s_2)
    e_3 += ei

    # Stage 2:   s = sum(s_1)
    s = zero(T)
    e = zero(T)
    for i in 1:W
        @inbounds xi = s_1[i]
        s, ei = two_sum(s, xi)
        e += ei
    end
    e += sum((e_1+e_2) + (e_3+e_4))

    # Remainder, unvectorized
    while offset <= N
        @inbounds xi = x[offset]
        s, ei = two_sum(s, xi)
        e += ei
        offset += 1
    end

    s+e
end

3. Accuracy tests

relative_error(x, y) = (x - y) / y

function check_fun(x, f)
    v = f(x)
    v, relative_error(v, Float64(sum(Rational{BigInt}, x)))
end

using the input data proposed by @Elrod:

julia> x = 10^14 .* sin.((0.001:0.001:1) .* 2π);

julia> check_fun(x, sum)
(-0.41792555317044705, -1.3715427199471744)

julia> check_fun(x, sum_oro_scal)
(1.1248384929460276, 9.87006607248485e-16)

julia> check_fun(x, sum_oro_vec)
(1.1248384929460264, 0.0)

julia> using KahanSummation
julia> check_fun(x, sum_kbn)
(1.1248384929460276, 9.87006607248485e-16)

or the input data proposed by @Per (slightly modified so that they sum to 1 and we can compute relative errors)

julia> x = 5000 |> N->randn(N) .* exp.(10 .* randn(N)) |> x->[x;-x;1.0] |> x->x[sortperm(rand(length(x)))];

julia> check_fun(x, sum)
(0.734375, -0.265625)

julia> check_fun(x, sum_oro_scal)
(0.9999999999999993, -6.661338147750939e-16)

julia> check_fun(x, sum_oro_vec)
(1.0000000000000002, 2.220446049250313e-16)

julia> check_fun(x, sum_kbn)
(0.9999999999999993, -6.661338147750939e-16)

4. Performance tests

julia> using BenchmarkTools

julia> @btime sum($x)
  1.314 μs (0 allocations: 0 bytes)
0.734375

julia> @btime sum_oro_scal($x)
  19.686 μs (0 allocations: 0 bytes)
0.9999999999999993

julia> @btime sum_oro_vec($x)
  3.392 μs (0 allocations: 0 bytes)
1.0000000000000002

julia> @btime sum_kbn($x)
  20.389 μs (0 allocations: 0 bytes)
0.9999999999999993

So the vectorized version incurs a slow down by a factor 2 to 3 w.r.t the standard pairwise summation. I would be curious to compare this to a vectorized KBN implementation (but I fear the efficient implementation of something that features a test is above my capabilities)



Full sources for anyone wanting to try this
import VectorizationBase
using SIMD

# D. E. Knuth, The Art of Computer Programming: Seminumerical Algorithms, 1969.
function two_sum(a::T, b::T) where T <: Union{Real, Vec}
    x = a + b
    z = x - a
    e = (a - (x-z)) + (b-z)
    (x, e)
end


# T. Ogita, S. Rump and S. Oishi, "Accurate sum and dot product",
# SIAM Journal on Scientific Computing, 6(26), 2005.
# DOI: 10.1137/030601818
function sum_oro_scal(x)
    s = 0.
    e = 0.
    for xi in x
        s, ei = two_sum(s, xi)
        e += ei
    end

    s + e
end



# T. Ogita, S. Rump and S. Oishi, "Accurate sum and dot product",
# SIAM Journal on Scientific Computing, 6(26), 2005.
# DOI: 10.1137/030601818
function sum_oro_vec(x::AbstractArray{T}) where T <: Union{Float32, Float64}
    W = VectorizationBase.pick_vector_width(T)
    N = length(x)

    Base.Cartesian.@nexprs 4 u -> begin
        s_u = Vec{W,T}(0)
        e_u = Vec{W,T}(0)
    end

    offset = 1

    # Partially unrolled loop by batches of 4 packs of width W
    while offset+4W-1 <= N
        Base.Cartesian.@nexprs 4 u -> begin
            xi = SIMD.vload(Vec{W,T}, x, offset)
            s_u, ei = two_sum(s_u, xi)
            e_u += ei
            offset += W
        end
    end

    # Remainder by packs of width W
    while offset+W-1 <= N
        xi = SIMD.vload(Vec{W,T}, x, offset)
        s_1, ei = two_sum(s_1, xi)
        e_1 += ei
        offset += W
    end

    # Reduction: (I'm guessing what to do here)
    # - always compensate summation of high-order terms
    # - error terms are added naively

    # Stage 1:   s_1 += s_2 + s_3 + s_4
    s_1, ei = two_sum(s_1, s_3)
    e_1 += ei

    s_2, ei = two_sum(s_2, s_4)
    e_2 += ei

    s_1, ei = two_sum(s_1, s_2)
    e_3 += ei

    # Stage 2:   s = sum(s_1)
    s = zero(T)
    e = zero(T)
    for i in 1:W
        @inbounds xi = s_1[i]
        s, ei = two_sum(s, xi)
        e += ei
    end
    e += sum((e_1+e_2) + (e_3+e_4))

    # Remainder, unvectorized
    while offset <= N
        @inbounds xi = x[offset]
        s, ei = two_sum(s, xi)
        e += ei
        offset += 1
    end

    s+e
end



using KahanSummation

relative_error(x, y) = (x - y) / y

function check_fun(x, f)
    v = f(x)
    v, relative_error(v, Float64(sum(Rational{BigInt}, x)))
end

x = 10^14 .* sin.((0.001:0.001:1) .* 2π);
check_fun(x, sum)
check_fun(x, sum_oro_scal)
check_fun(x, sum_oro_vec)
check_fun(x, sum_kbn)

x = 5000 |> N->randn(N) .* exp.(10 .* randn(N)) |> x->[x;-x;1.0] |> x->x[sortperm(rand(length(x)))];
check_fun(x, sum)
check_fun(x, sum_oro_scal)
check_fun(x, sum_oro_vec)
check_fun(x, sum_kbn)



using BenchmarkTools
@btime sum($x)
@btime sum_oro_scal($x)
@btime sum_oro_vec($x)
@btime sum_kbn($x)
6 Likes