Not my idea, I’ve only described the technique as told here - as far as I understand, any derivation of GPL’d code has to be GPL’d itself. By describing the algorithm with various sources without directly translating it to make it as easy as possible to reimplement it, it’s possible to (somewhat) circumvent this.

Thanks for the feedback everyone — looks like I have plenty of alternatives to choose from at the moment.

The default sum function in Julia is also quite accurate because it uses pairwise summation

I didn’t realize the regular sum function uses pairwise summation, this will probably be adequate for my present requirements. I need to hand in a PhD thesis in the 2 months time so it sounds like someone else will get to implementing Shewchuck’s (or Neal’s) algorithm before me. If not I will look at it again at that time.

EDIT : Out of curiosity, would this fsum equivalent belong in the stdlib or somewhere else?

Please do not do that. Translating GPL code to pseudocode that “looks suspiciously like Python” and then having someone else translate that pseudocode into actual Python (or in this case Julia) is far more legally dicey than just reading the paper and implementing the algorithm based on the description while disregarding the GPL code that’s included. Translating twice does not break the “derived work link”—you can translate through as many different programming languages as you want and it’s still a derived work.

(Note that the published code by Radford is not GPL—in fact, it has no license statement at all, which means that you cannot redistribute it, or code derived from it, in any form. Clean-room re-implementation is not about “GPL avoidance,” it’s about working around copyright law’s definition of “derived work.” In this case, the paper itself contains code excerpts, so it is especially tricky to re-implement the algorithm in a clean way from a copyright perspective.)

The author emailed me to say he was planning on posting a cleaned-up version of the code under a free/open-source license (his initial impulse was GPL, but I’ve urged him to consider a more permissive license for a small library routine like this). Not sure about the timeframe.

A Kahan / Pairwise polyalgorithm seems like it could still be pretty fast, it might be +15*3 operations for summing 256 numbers depending on where you switch. Not sure how much extra precision you’d get

Pairwise summation is the default in Julia; it is just as fast as naive summation and much more accurate. Kahan summation is even more accurate but significantly slower, which is why it is not the default.

It’s a pretty neat trick too, and definitely a hard default to argue against. I was just noticing that each pass of the pairwise sum makes a partial Kahan sum on the remaining values 50% cheaper

Another approach

function finiteDif_sum(arr, d= 0.0)
x = 0.0; y=0.0;
y += d;
for i in arr
x = x + x-y + d + i; ## +y-x -d ?
y = y + i + y + i - x - d; ## and + x - y - i + d ?
end;
[x,y - d]
end;

Radford’s C code is designed for summing C doubles. We could create an interface to the C code, and that would work summing many Float64s accurately. Or, we could reimplement it in Julia an allow for summing T<:AbstractFloat. The maximum number of floats summable by the small superaccumulator will vary with the precision of the values being summed.

Also, in a Julia implementation the size of the accumulator could be a type parameter, making it possible to use less memory when the magnitude of the sum is known to be within a given range.

Having read Radford’s paper, I’m worried that the accumulator will completely fill the L1 cache and therefore be slower in practice than the benchmarks show…

It also argues that most of these entries are not used for typical data with limited range. In any case, it claims at most a factor of two benefit from the “fast” method vs. the “small” method; I suspect that most cases requiring an exactly rounded sum are not performance-critical enough for this to matter. It would be nice to have an implementation of at least the “small” method in a Julia package for people who need exactly rounded floating-point sums.

I was too lazy to re-implement it from scratch, but I think this should be sufficient for most needs: it provides an xsum function that does exact double-precision (Float64 and ComplexF64) accumulation over arbitrary iterators (with an optimized method for contiguous arrays).

It seems to be quite fast — for an array of 10⁴ elements, it is actually about the same speed as a naive sum without SIMD. With @simd (which is enabled for the built-in sum) it is about 10x slower, since Neal’s code does not seem to exploit SIMD.

julia> function naivesum(a)
s = zero(eltype(a))
for x in a
s += x
end
return s
end
mysum (generic function with 1 method)
julia> a = randn(10^4);
julia> using Xsum, BenchmarkTools
julia> @btime sum($a);
1.107 μs (0 allocations: 0 bytes)
julia> @btime xsum($a);
12.272 μs (0 allocations: 0 bytes)
julia> @btime naivesum($a);
12.175 μs (0 allocations: 0 bytes)
julia> err(s, exact) = abs(s - exact) / abs(exact) # relative error
err (generic function with 1 method)
julia> exact = Float64(sum(big.(a))); err(sum(a), exact), err(xsum(a), exact), err(naivesum(a), exact)
(1.5635496798094023e-16, 0.0, 6.097843751256669e-15)

On my machine, Neal’s “large” accumulator starts being faster than the “small” accumulator around n=256.

The algorithm has a lot of data-dependent memory addressing, which would result in gather / scatter operations that are detrimental to SIMD performance. Also, consecutive inputs may or may not affect the same part of the accumulator, making it tricky to handle them in parallel.

Maybe one could develop a fast SIMD implementation that only works on a very limited range of exponents (a user-selected offset ± N) and send any numbers outside of this range to the slower algorithm. The entire accumulation buffer should then fit in a SIMD vector.