Radford’s C code has now been posted under an MIT license: gitlab.com/radfordneal/xsum
Radford’s C code is designed for summing C doubles. We could create an interface to the C code, and that would work summing many Float64s accurately. Or, we could reimplement it in Julia an allow for summing T<:AbstractFloat. The maximum number of floats summable by the small superaccumulator will vary with the precision of the values being summed.
Also, in a Julia implementation the size of the accumulator could be a type parameter, making it possible to use less memory when the magnitude of the sum is known to be within a given range.
Having read Radford’s paper, I’m worried that the accumulator will completely fill the L1 cache and therefore be slower in practice than the benchmarks show…
His superaccumulator is 536 bytes for double precision (67 64-bit chunks). L1 cache sizes are typically ~32kB.
The paper discusses two different superaccumulators, where the “fast” one is 4096 x 64 bit (16 kB).
It also argues that most of these entries are not used for typical data with limited range. In any case, it claims at most a factor of two benefit from the “fast” method vs. the “small” method; I suspect that most cases requiring an exactly rounded sum are not performance-critical enough for this to matter. It would be nice to have an implementation of at least the “small” method in a Julia package for people who need exactly rounded floating-point sums.
This is exactly what I was referring to. It’d be easy to write an implementation of the “fast” method with user-selectable range and memory footprint.