Radford’s C code has now been posted under an MIT license: gitlab.com/radfordneal/xsum
Radford’s C code is designed for summing C doubles. We could create an interface to the C code, and that would work summing many Float64s accurately. Or, we could reimplement it in Julia an allow for summing T<:AbstractFloat. The maximum number of floats summable by the small superaccumulator will vary with the precision of the values being summed.
Also, in a Julia implementation the size of the accumulator could be a type parameter, making it possible to use less memory when the magnitude of the sum is known to be within a given range.
Having read Radford’s paper, I’m worried that the accumulator will completely fill the L1 cache and therefore be slower in practice than the benchmarks show…
His superaccumulator is 536 bytes for double precision (67 64-bit chunks). L1 cache sizes are typically ~32kB.
The paper discusses two different superaccumulators, where the “fast” one is 4096 x 64 bit (16 kB).
It also argues that most of these entries are not used for typical data with limited range. In any case, it claims at most a factor of two benefit from the “fast” method vs. the “small” method; I suspect that most cases requiring an exactly rounded sum are not performance-critical enough for this to matter. It would be nice to have an implementation of at least the “small” method in a Julia package for people who need exactly rounded floating-point sums.
This is exactly what I was referring to. It’d be easy to write an implementation of the “fast” method with user-selectable range and memory footprint.
I put together a Julia wrapper for his C code at:
I was too lazy to re-implement it from scratch, but I think this should be sufficient for most needs: it provides an xsum
function that does exact double-precision (Float64
and ComplexF64
) accumulation over arbitrary iterators (with an optimized method for contiguous arrays).
It seems to be quite fast — for an array of 10⁴ elements, it is actually about the same speed as a naive sum without SIMD. With @simd
(which is enabled for the built-in sum
) it is about 10x slower, since Neal’s code does not seem to exploit SIMD.
julia> function naivesum(a)
s = zero(eltype(a))
for x in a
s += x
end
return s
end
mysum (generic function with 1 method)
julia> a = randn(10^4);
julia> using Xsum, BenchmarkTools
julia> @btime sum($a);
1.107 μs (0 allocations: 0 bytes)
julia> @btime xsum($a);
12.272 μs (0 allocations: 0 bytes)
julia> @btime naivesum($a);
12.175 μs (0 allocations: 0 bytes)
julia> err(s, exact) = abs(s - exact) / abs(exact) # relative error
err (generic function with 1 method)
julia> exact = Float64(sum(big.(a))); err(sum(a), exact), err(xsum(a), exact), err(naivesum(a), exact)
(1.5635496798094023e-16, 0.0, 6.097843751256669e-15)
On my machine, Neal’s “large” accumulator starts being faster than the “small” accumulator around n=256.
Nice work! That’s impressive speed for such a complex summation algorithm. I have to wonder if SIMD could make it even more competitive.
The algorithm has a lot of data-dependent memory addressing, which would result in gather / scatter operations that are detrimental to SIMD performance. Also, consecutive inputs may or may not affect the same part of the accumulator, making it tricky to handle them in parallel.
Maybe one could develop a fast SIMD implementation that only works on a very limited range of exponents (a user-selected offset ± N) and send any numbers outside of this range to the slower algorithm. The entire accumulation buffer should then fit in a SIMD vector.
What about a SPMD-style implementation?
That is, if you have a vector x
, and the computer’s SIMD width is W, you perform W xsum
calculations. The first sum is of elements x[1]
, x[1+W]
, x[1+2W]
,…, while the second sum is of x[2]
, x[2+W]
, x[2+2W]
, etc.
Each of these W sums is calculated serially. But we’re calculating W
in parallel.
No need for gather/scatter.
The hard part would be keeping it correctly rounded while adding the remainder (length(x) % W
) and reducing those W
sums, but I’d guess that’s possible – although I am unfamiliar with the xsum
algorithm, so I may be wrong.
It would still be gather/scatter because of the indexing into the W accumulators.
For fun, I tested with more difficult data. (randn
gives a very small range of exponents.)
julia> a = 5000 |> N->randn(N) .* exp.(50 .* randn(N)) |> x->[x;-x] |> x->x[sortperm(rand(length(x)))];
julia> @btime xsum($a)
22.596 μs (0 allocations: 0 bytes)
0.0
This takes about twice as long as with the simple randn
data, presumably due to more cache misses.
The vector sums to zero by construction (which xsum
gets right of course.)
For comparison, sum(big.(a))
is more than an order of magnitude slower, and does not give the exact answer using the default 256-bit precision:
julia> @btime sum(big.($a))
776.165 μs (20004 allocations: 1.14 MiB)
0.1244434958829871841314559402473363089103990532787747361444057905702476659871387
I’m not sure it is ever useful to convert to BigFloat
numbers randomly generated for another precision.
On my machine, xsum
is faster than sum_kbn
(KahanSummation.jl) at ~800 values. For those who may not know, sum_kbn
is quite a bit better than naive_sum
at accurate summation, but, unlike xsum
, is not assured to give the correct sum.
Ah, I take it you mean the algorithm uses different accumulators based on the magnitude?
In that case, yes.
I tried to vectorize Kahan summation, but I think I need to make the reduction more accurate: I’m losing a lot of accuracy compared to sum_kbn
from KahanSummation.jl.
julia> using VectorizationBase, SIMDPirates, KahanSummation, BenchmarkTools
julia> function sum_kahan(x::AbstractArray{T}) where {T <: Union{Float32,Float64}}
px = pointer(x)
W, shift = VectorizationBase.pick_vector_width_shift(T)
sizeT = sizeof(T)
WT = W * sizeT
WT4 = 4WT
N = length(x)
Base.Cartesian.@nexprs 4 u -> begin
s_u = SIMDPirates.vbroadcast(Vec{W,T}, zero(T))
c_u = SIMDPirates.vbroadcast(Vec{W,T}, zero(T))
end
Nshift = N >> (shift + 2)
for n in 0:Nshift-1
Base.Cartesian.@nexprs 4 u -> begin
x_u = SIMDPirates.vload(Vec{W,T}, px + n*WT4 + (u-1)*WT)
t_u = SIMDPirates.evadd(s_u, x_u)
m_u = SIMDPirates.vless(SIMDPirates.vabs(x_u), SIMDPirates.vabs(s_u))
subfrom_u = SIMDPirates.vifelse(m_u, s_u, x_u)
addto_u = SIMDPirates.vifelse(m_u, x_u, s_u)
c_u = SIMDPirates.evsub(c_u, SIMDPirates.evadd(SIMDPirates.evsub(subfrom_u, t_u), addto_u))
s_u = t_u
end
end
rem = N & (4W-1)
offset = Nshift * WT4
for n in 0:(rem >> shift) - 1
x_1 = SIMDPirates.vload(Vec{W,T}, px + offset)
t_1 = SIMDPirates.evadd(s_1, x_1)
m_1 = SIMDPirates.vless(SIMDPirates.vabs(x_1), SIMDPirates.vabs(s_1))
subfrom_1 = SIMDPirates.vifelse(m_1, s_1, x_1)
addto_1 = SIMDPirates.vifelse(m_1, x_1, s_1)
c_1 = SIMDPirates.evsub(c_1, SIMDPirates.evadd(SIMDPirates.evsub(subfrom_1, t_1), addto_1))
s_1 = t_1
offset += WT
end
rem &= (W-1)
if rem > 0
mask = unsafe_trunc(VectorizationBase.mask_type(T), 2^rem - 1)
x_2 = SIMDPirates.vload(Vec{W,T}, px + offset, mask)
t_2 = SIMDPirates.evadd(s_2, x_2)
m_2 = SIMDPirates.vless(SIMDPirates.vabs(x_2), SIMDPirates.vabs(s_2))
subfrom_2 = SIMDPirates.vifelse(m_2, s_2, x_2)
addto_2 = SIMDPirates.vifelse(m_2, x_2, s_2)
c_2 = SIMDPirates.evsub(c_2, SIMDPirates.evadd(SIMDPirates.evsub(subfrom_2, t_2), addto_2))
s_2 = t_2
end
c_1 = SIMDPirates.vadd(c_1, c_2)
c_3 = SIMDPirates.vadd(c_3, c_4)
c_1 = SIMDPirates.vadd(c_1, c_3)
s_1 = SIMDPirates.vadd(s_1, s_2)
s_3 = SIMDPirates.vadd(s_3, s_4)
s_1 = SIMDPirates.vadd(s_1, s_3)
SIMDPirates.vsum(SIMDPirates.vsub(s_1,c_1))
end
sum_kahan (generic function with 1 method)
julia> x = 10^14 .* sin.((0.001:0.001:1) .* 2π);
julia> @btime Float64(sum(big, $x))
91.590 μs (3998 allocations: 218.64 KiB)
1.1248384929460264
julia> @btime sum($x)
43.253 ns (0 allocations: 0 bytes)
0.49223069682955295
julia> @btime sum_kbn($x)
1.241 μs (0 allocations: 0 bytes)
1.1248384929460276
julia> @btime sum_kahan($x)
151.435 ns (0 allocations: 0 bytes)
1.124755859375
It’s a lot more accurate than pairwise summation, and much faster than sum_kbn
, but given that sum_kbn
’s answer is correct and sum_kahan
’s is fairly off, I’d probably pick the former (if I’m expecting to sum vectors like x
).
julia> relative_error(x, y) = (x - y) / y
julia> relative_error(sum_kbn(x), Float64(sum(big, x)))
9.87006607248485e-16
julia> relative_error(sum_kahan(x), Float64(sum(big, x)))
-7.346260956097278e-5
julia> relative_error(sum(x), Float64(sum(big, x)))
-0.562398779988078
Any suggestions on how to improve the accuracy of sum_kahan
?
Kahan summation fails for Per’s vector, while pairwise gets it right:
julia> a = 5000 |> N->randn(N) .* exp.(50 .* randn(N)) |> x->[x;-x] |> x->x[sortperm(rand(length(x)))];
julia> @btime Float64(sum(big, $a))
1.052 ms (39998 allocations: 2.14 MiB)
1.40737488355328e14
julia> @btime sum($a)
985.308 ns (0 allocations: 0 bytes)
0.0
julia> @btime sum_kbn($x)
1.241 μs (0 allocations: 0 bytes)
1.1912085629266485e-5
julia> @btime sum_kahan($x)
147.680 ns (0 allocations: 0 bytes)
1.1913478374481201e-5
julia> setprecision(1024); # 256 isn't good enough
julia> @btime sum(big, $a)
1.172 ms (39998 allocations: 3.97 MiB)
0.0
Is this reproducible on your machine, for various arrays a
?
I would tend to think you got (un)lucky on this one:
julia> a = 5000 |> N->randn(N) .* exp.(50 .* randn(N)) |> x->[x;-x] |> x->x[sortperm(rand(length(x)))];
julia> sum(a)
-3.064991081731778e54
julia> sum(big, a)
-1.430511474609375e-06
julia> sum(Rational{BigInt}, a)
0//1
julia> a = 5000 |> N->randn(N) .* exp.(50 .* randn(N)) |> x->[x;-x] |> x->x[sortperm(rand(length(x)))];
julia> sum(a)
2.0892633586169203e54
julia> sum(big, a)
5.9304284150130115449428558349609375e-07
julia> sum(Rational{BigInt}, a)
0//1
Yes. Changing the 50
to 5
inside of exp
will significantly recuce the likelihood of getting the right sum
by accident.
I don’t think your problem comes from the final reduction, but rather from the vectorized summation itself.
After making many tests (most of which making no sense whatsoever), I have come to a strong feeling that LLVM is breaking some of the compensation during its optimization phase.
The most minimal example I could come up with in order to demonstrate the issue is using the Priest summation algorithm (which I find a bit easier to reason about than Kahan, but it might be because I’m much more used to it).
EDIT: this algorithm seems to have been introduced by Ogita, Rump and Oishi in 2005. I incorrectly attributed it to Priest, whose summation algorithm is doubly compensated
First, the scalar version:
function sum_priest_scal(x)
s = 0.
e = 0.
for xi in x
s, ei = twoSum(s, xi)
e += ei
end
s + e
end
function twoSum(a::Float64, b::Float64)
x = a+b
z = x - a
e = (a - (x-z)) + (b-z)
(x, e)
end
we can check that it does work quite well on your input data:
relative_error(x, y) = (x - y) / y
function check_fun(f, vec_size=1000)
x = 10^14 .* sin.((0.001:0.001:1) .* 2π)
x = x[1:vec_size]
v = f(x)
v, relative_error(v, Float64(sum(big, x)))
end
julia> check_fun(sum_priest_scal)
(1.1248384929460276, 9.87006607248485e-16)
Now a (naively) vectorized version which, for the sake of simplicity, only works for vectors whose size is a multiple of the vector pack size
using VectorizationBase, SIMDPirates
function sum_priest_vec(x::AbstractArray{T}) where {T <: Union{Float32,Float64}}
W = VectorizationBase.pick_vector_width(T)
s = SIMDPirates.vbroadcast(Vec{W,T}, zero(T))
e = SIMDPirates.vbroadcast(Vec{W,T}, zero(T))
offset = 1
while offset+W <= length(x)+1
xi = SIMDPirates.vload(Vec{W,T}, x, offset)
s,ei = twoSum(s, xi)
# Comment or Uncomment the following line to exhibit different behavior
# @show ei
e = SIMDPirates.vadd(e, ei)
offset += W
end
# If the array size is not a multiple of W, we should handle the remainder here
@assert offset == length(x)+1
s = SIMDPirates.vadd(s, e)
SIMDPirates.vsum(s)
end
function twoSum(a, b)
x = SIMDPirates.vadd(a, b)
z = SIMDPirates.vsub(x, a)
e = SIMDPirates.vadd(
SIMDPirates.vsub(a, SIMDPirates.vsub(x, z)),
SIMDPirates.vsub(b, z)
)
(x, e)
end
this version can only be tested on a part of your input vector, which then becomes much more well-conditioned. We still observe some loss of accuracy w.r.t. the scalar version:
julia> check_fun(sum_priest_scal, 996)
(3.7697623564331396e12, 0.0)
julia> check_fun(sum_priest_vec, 996)
(3.769762356433375e12, 6.243140554957531e-14)
Some of this accuracy can be retrieved by uncommenting the @show
line above in sum_priest_vec
:
julia> check_fun(sum_priest_vec, 996)
# [... lots of output ...]
(3.76976235643314e12, 1.2952573765472056e-16)
I suspect that this is due to the additional optimizations that are allowed by SIMDPirates.jl
in comparison to SIMD.jl
. For example, the same algorithm as above implemented with SIMD.jl
does not exhibit a different behavior than the scalar version.
# (in an other session, since SIMD and SIMDPirates conflict)
using VectorizationBase, SIMD
function sum_priest_vec2(x::AbstractArray{T}) where {T <: Union{Float32,Float64}}
W = VectorizationBase.pick_vector_width(T)
s = Vec{W,T}(0)
e = Vec{W,T}(0)
offset = 1
while offset+W <= length(x)+1
xi = SIMD.vload(Vec{W,T}, x, offset)
s,ei = twoSum(s, xi)
e += ei
offset += W
end
# If the array size is not a multiple of W, we should handle the remainder here
@assert offset == length(x)+1
sum(s+e)
end
function twoSum(a::Vec, b::Vec)
x = a+b
z = x - a
e = (a - (x-z)) + (b-z)
(x, e)
end
julia> check_fun(sum_priest_vec2, 996)
(3.7697623564331396e12, 0.0)
I have not yet implemented fully vectorized versions (accounting for the remainder) in order to check what happens with the full 1000-element vector, and look at the performances.
Here is a complete vectorized (but maybe naive) implementation, correctly handling remainders. I took the opportunity to correct the attributions in my previous post: the algorithm I was using was introduced by Ogita, Rump & Oishi in 2005. It is equivalent to Kahan-Babuska, except the Error-Free Transform at its core is the “twoSum” algorithm by Knuth. In contrast, KB uses the “fastTwoSum” algorithm by Dekker, which needs a test to sort its arguments and may therefore be less suitable for vectorization (although @Elrod seems to know quite a few tricks to still implement this efficiently)
In any case, here is the full code:
1. Simple, scalar implementation of “Sum2” by Ogita, Rump and Oishi
import VectorizationBase
using SIMD
# D. E. Knuth, The Art of Computer Programming: Seminumerical Algorithms, 1969.
function two_sum(a::T, b::T) where T <: Union{Real, Vec}
x = a + b
z = x - a
e = (a - (x-z)) + (b-z)
(x, e)
end
# T. Ogita, S. Rump and S. Oishi, "Accurate sum and dot product",
# SIAM Journal on Scientific Computing, 6(26), 2005.
# DOI: 10.1137/030601818
function sum_oro_scal(x)
s = 0.
e = 0.
for xi in x
s, ei = two_sum(s, xi)
e += ei
end
s + e
end
2. Vectorized implementation of the same algorithm using SIMD.jl
and a vectorization technique similar to that of @Elrod above (thanks, I learned quite a lot!)
# T. Ogita, S. Rump and S. Oishi, "Accurate sum and dot product",
# SIAM Journal on Scientific Computing, 6(26), 2005.
# DOI: 10.1137/030601818
function sum_oro_vec(x::AbstractArray{T}) where T <: Union{Float32, Float64}
W = VectorizationBase.pick_vector_width(T)
N = length(x)
Base.Cartesian.@nexprs 4 u -> begin
s_u = Vec{W,T}(0)
e_u = Vec{W,T}(0)
end
offset = 1
# Partially unrolled loop by batches of 4 packs of width W
while offset+4W-1 <= N
Base.Cartesian.@nexprs 4 u -> begin
xi = SIMD.vload(Vec{W,T}, x, offset)
s_u, ei = two_sum(s_u, xi)
e_u += ei
offset += W
end
end
# Remainder by packs of width W
while offset+W-1 <= N
xi = SIMD.vload(Vec{W,T}, x, offset)
s_1, ei = two_sum(s_1, xi)
e_1 += ei
offset += W
end
# Reduction: (I'm guessing what to do here)
# - always compensate summation of high-order terms
# - error terms are added naively
# Stage 1: s_1 += s_2 + s_3 + s_4
s_1, ei = two_sum(s_1, s_3)
e_1 += ei
s_2, ei = two_sum(s_2, s_4)
e_2 += ei
s_1, ei = two_sum(s_1, s_2)
e_3 += ei
# Stage 2: s = sum(s_1)
s = zero(T)
e = zero(T)
for i in 1:W
@inbounds xi = s_1[i]
s, ei = two_sum(s, xi)
e += ei
end
e += sum((e_1+e_2) + (e_3+e_4))
# Remainder, unvectorized
while offset <= N
@inbounds xi = x[offset]
s, ei = two_sum(s, xi)
e += ei
offset += 1
end
s+e
end
3. Accuracy tests
relative_error(x, y) = (x - y) / y
function check_fun(x, f)
v = f(x)
v, relative_error(v, Float64(sum(Rational{BigInt}, x)))
end
using the input data proposed by @Elrod:
julia> x = 10^14 .* sin.((0.001:0.001:1) .* 2π);
julia> check_fun(x, sum)
(-0.41792555317044705, -1.3715427199471744)
julia> check_fun(x, sum_oro_scal)
(1.1248384929460276, 9.87006607248485e-16)
julia> check_fun(x, sum_oro_vec)
(1.1248384929460264, 0.0)
julia> using KahanSummation
julia> check_fun(x, sum_kbn)
(1.1248384929460276, 9.87006607248485e-16)
or the input data proposed by @Per (slightly modified so that they sum to 1 and we can compute relative errors)
julia> x = 5000 |> N->randn(N) .* exp.(10 .* randn(N)) |> x->[x;-x;1.0] |> x->x[sortperm(rand(length(x)))];
julia> check_fun(x, sum)
(0.734375, -0.265625)
julia> check_fun(x, sum_oro_scal)
(0.9999999999999993, -6.661338147750939e-16)
julia> check_fun(x, sum_oro_vec)
(1.0000000000000002, 2.220446049250313e-16)
julia> check_fun(x, sum_kbn)
(0.9999999999999993, -6.661338147750939e-16)
4. Performance tests
julia> using BenchmarkTools
julia> @btime sum($x)
1.314 μs (0 allocations: 0 bytes)
0.734375
julia> @btime sum_oro_scal($x)
19.686 μs (0 allocations: 0 bytes)
0.9999999999999993
julia> @btime sum_oro_vec($x)
3.392 μs (0 allocations: 0 bytes)
1.0000000000000002
julia> @btime sum_kbn($x)
20.389 μs (0 allocations: 0 bytes)
0.9999999999999993
So the vectorized version incurs a slow down by a factor 2 to 3 w.r.t the standard pairwise summation. I would be curious to compare this to a vectorized KBN implementation (but I fear the efficient implementation of something that features a test is above my capabilities)
Full sources for anyone wanting to try this
import VectorizationBase
using SIMD
# D. E. Knuth, The Art of Computer Programming: Seminumerical Algorithms, 1969.
function two_sum(a::T, b::T) where T <: Union{Real, Vec}
x = a + b
z = x - a
e = (a - (x-z)) + (b-z)
(x, e)
end
# T. Ogita, S. Rump and S. Oishi, "Accurate sum and dot product",
# SIAM Journal on Scientific Computing, 6(26), 2005.
# DOI: 10.1137/030601818
function sum_oro_scal(x)
s = 0.
e = 0.
for xi in x
s, ei = two_sum(s, xi)
e += ei
end
s + e
end
# T. Ogita, S. Rump and S. Oishi, "Accurate sum and dot product",
# SIAM Journal on Scientific Computing, 6(26), 2005.
# DOI: 10.1137/030601818
function sum_oro_vec(x::AbstractArray{T}) where T <: Union{Float32, Float64}
W = VectorizationBase.pick_vector_width(T)
N = length(x)
Base.Cartesian.@nexprs 4 u -> begin
s_u = Vec{W,T}(0)
e_u = Vec{W,T}(0)
end
offset = 1
# Partially unrolled loop by batches of 4 packs of width W
while offset+4W-1 <= N
Base.Cartesian.@nexprs 4 u -> begin
xi = SIMD.vload(Vec{W,T}, x, offset)
s_u, ei = two_sum(s_u, xi)
e_u += ei
offset += W
end
end
# Remainder by packs of width W
while offset+W-1 <= N
xi = SIMD.vload(Vec{W,T}, x, offset)
s_1, ei = two_sum(s_1, xi)
e_1 += ei
offset += W
end
# Reduction: (I'm guessing what to do here)
# - always compensate summation of high-order terms
# - error terms are added naively
# Stage 1: s_1 += s_2 + s_3 + s_4
s_1, ei = two_sum(s_1, s_3)
e_1 += ei
s_2, ei = two_sum(s_2, s_4)
e_2 += ei
s_1, ei = two_sum(s_1, s_2)
e_3 += ei
# Stage 2: s = sum(s_1)
s = zero(T)
e = zero(T)
for i in 1:W
@inbounds xi = s_1[i]
s, ei = two_sum(s, xi)
e += ei
end
e += sum((e_1+e_2) + (e_3+e_4))
# Remainder, unvectorized
while offset <= N
@inbounds xi = x[offset]
s, ei = two_sum(s, xi)
e += ei
offset += 1
end
s+e
end
using KahanSummation
relative_error(x, y) = (x - y) / y
function check_fun(x, f)
v = f(x)
v, relative_error(v, Float64(sum(Rational{BigInt}, x)))
end
x = 10^14 .* sin.((0.001:0.001:1) .* 2π);
check_fun(x, sum)
check_fun(x, sum_oro_scal)
check_fun(x, sum_oro_vec)
check_fun(x, sum_kbn)
x = 5000 |> N->randn(N) .* exp.(10 .* randn(N)) |> x->[x;-x;1.0] |> x->x[sortperm(rand(length(x)))];
check_fun(x, sum)
check_fun(x, sum_oro_scal)
check_fun(x, sum_oro_vec)
check_fun(x, sum_kbn)
using BenchmarkTools
@btime sum($x)
@btime sum_oro_scal($x)
@btime sum_oro_vec($x)
@btime sum_kbn($x)