# Speed issue with KahanSummation

I am using Julia version 1.7.1. I need to do summations of long series (repeatedly) where accuracy is important. I am looking at KahanSummation’ `sum_kbn` (pure Julia), but am puzzled by the very slow speed (with a lot of allocations). It is about 248 times slower than the regular `sum`, 230 times slower than the `sum_kbn` from AccurateArithmetic (C wrapper), and 62 times slower than my naive implementation of the same algorithm (`naive_kbn`). The `naive_kbn`, in turn, is about 3.67 times slower than AccurateArithmetic’s `sum_kbn`.

I also tried using `naive_kbn` with Threads.@spawn (thanks to the suggestion of @tkf from another thread), but it seems that the parallel introduces (rounding) errors making the summation not as accurate.

So my question: Is it possible to improve KahanSummation’s `sum_kbn`, making it much faster than my naive implementation? Or any implementation in pure Julia that has high accuracy with good speed?

Background: I wish to use high-accuracy algorithms in pure Julia because I need it to work with ForwardDiff which requires all the codes in pure Julia. So AccurateArithmetic is out. I could use my naive implementation (`naive_kbn`), but I still find it slow. I tried to increase the speed using Threads.@spawn, but it seems to introduce rounding errors.

Here is the code to demonstrate the statements above.

``````using Random, BenchmarkTools, ThreadsX
using KahanSummation, AccurateArithmetic

function test_speed(algo)
Random.seed!(1244)
bb = 0.0
aa1 = rand(500_000)
aa2 = shuffle(-1.0*aa1)
aa  = vcat(aa1, aa2)  # should sum to 0.0
for i in 1:100
bb += algo(aa)
end
return bb
end

function naive_kbn(x)
s = 0.0
c = 0.0
s = x - c
for i in 2:length(x)
t = s + x[i]
abs(s) >= abs(x[i]) ? c -= ((s-t) + x[i]) : c -= ((x[i] - t) + s)
s = t
end
return s - c
end

function naive_kbn_parallel(xs)  # credit to @tkf from another thread
if length(xs) <= 125000
return naive_kbn(xs)
else
return naive_kbn_parallel(@view xs[begin+(end-begin+1)÷2+1:end]) + fetch(t)
end
end
``````

Here are the speed comparisons.

``````julia> @btime test_speed(sum)  # not sum to 0.0
48.526 ms (14 allocations: 19.07 MiB)
-2.9103830456733704e-9

julia> @btime test_speed(AccurateArithmetic.sum_kbn) # not pure Julia
52.519 ms (14 allocations: 19.07 MiB)
0.0

julia> @btime test_speed(KahanSummation.sum_kbn) # very slow, lots allocations
12.052 s (399948714 allocations: 8.96 GiB)
0.0

julia> @btime test_speed(naive_kbn)   # could it be faster?
192.889 ms (14 allocations: 19.07 MiB)
0.0

julia> @btime test_speed(naive_kbn_parallel)  # good speed but not sum to 0.0
63.096 ms (8885 allocations: 19.75 MiB)
2.9103830456733704e-9
``````

I can’t reproduce this on

``````Julia Version 1.8.0-DEV.1309
Commit 89f23325aa (2022-01-13 19:48 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i7-10710U CPU @ 1.10GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.0 (ORCJIT, skylake)
Environment:
JULIA_EDITOR = code
``````

seeing

``````  27.343 ms (14 allocations: 19.07 MiB)
39.657 ms (14 allocations: 19.07 MiB)
160.417 ms (14 allocations: 19.07 MiB)
159.576 ms (14 allocations: 19.07 MiB)
50.452 ms (8903 allocations: 19.75 MiB)
``````

Regarding parallel summation you could do something like

``````using Random, BenchmarkTools
using KahanSummation, AccurateArithmetic

function test_speed(algo)
Random.seed!(1244)
bb = 0.0
aa1 = rand(500_000)
aa2 = shuffle(-1.0*aa1)
aa  = vcat(aa1, aa2)  # should sum to 0.0
for i in 1:100
bb += algo(aa)
end
return bb
end

function _naive_kbn(xs)
s = xs
c = 0.0
for i in 2:length(xs)
t = s + xs[i]
c += abs(s) >= abs(xs[i]) ? ((s - t) + xs[i]) : ((xs[i] - t) + s)
s = t
end
s, c
end

function naive_kbn_serial(xs)
s, c = _naive_kbn(xs)
s + c
end

function naive_kbn_parallel(xs)  # credit to @tkf from another thread
ys = Vector{Float64}(undef, 2 * nt)
len = length(xs)
chunk = len ÷ nt
s, c = _naive_kbn(@view xs[(i - 1) * chunk + 1:min(i * chunk, len)])
ys[2 * i - 1] = s
ys[2 * i] = c
end
naive_kbn_serial(ys)
end

println(test_speed(sum))
println(test_speed(AccurateArithmetic.sum_kbn))
println(test_speed(KahanSummation.sum_kbn))
println(test_speed(naive_kbn_serial))
println(test_speed(naive_kbn_parallel))

@btime test_speed(sum)
@btime test_speed(AccurateArithmetic.sum_kbn)
@btime test_speed(KahanSummation.sum_kbn)
@btime test_speed(naive_kbn_serial)
@btime test_speed(naive_kbn_parallel)
``````

yielding for me

``````-2.9103830456733704e-9
0.0
0.0
0.0
0.0
26.577 ms (14 allocations: 19.07 MiB)
33.483 ms (14 allocations: 19.07 MiB)
137.400 ms (14 allocations: 19.07 MiB)
139.154 ms (14 allocations: 19.07 MiB)
40.918 ms (2716 allocations: 19.39 MiB)
``````

Interesting. I tried it on another machine, also with Julia 1.7.1, and got the same result (i.e., KahanSummation very slow with a lot of allocations). I created a temporary environment with only the needed packages, same results.

However, if I tried it on another machine that has Julia 1.6.2, I got results similar to yours. So, it seems there are some compatibility issues in KahanSummation.jl with Julia 1.7.1.

It looks promising! It’s an acceptable balance between speed and accuracy. Many thanks.

Filed an issue.

The code `naive_kbn_parallel()` works great but has a small glitch. It omits the last m points in the data series xs where m=mod(length(xs), nt) and nt is the number of threads. I made a quick fix in case this thread was referenced in the future.

``````function naive_kbn_parallel(xs)
len = length(xs)

if len >= nt
ys = Vector{Real}(undef, 2 * nt )
chunk = len ÷ nt
s, c = _naive_kbn(@view xs[(i - 1) * chunk + 1 : min(i * chunk, len)])
ys[2 * i - 1] = s
ys[2 * i] = c
end

if mod(len, nt) > 0  #  small (< nt), `sum` is good enough
resid = sum(@view xs[nt*chunk + 1 : end])
else
resid = 0.0
end

return naive_kbn_serial(ys) + resid
else
return sum(xs)
end
end
``````
1 Like

AccurateArithmetic is implemented in Julia using `llvmcall`, not C. If your definition of “pure Julia” allows for wrapping LLVM’s SIMD intrinsics with `llvmcall`, then it is “pure Julia”.

We discuss its implementation here.

Our approach for vector reductions can also be applied to reducing parallel threads.

3 Likes

a) `VectorizationBase.Vec` should be ForwardDiff-compatible, but the support for that isn’t great beyond a few examples I needed. Feel free to file an issue.
b) Seems like there is almost certainly a better way to get gradients with respect to a sum. E.g., overload (type piracy…) to provide a specialized implementation that takes `ForwardDiff.Dual` as inputs.

3 Likes

AccurateArithmetic is implemented in Julia using `llvmcall` , not C. If your definition of “pure Julia” allows for wrapping LLVM’s SIMD intrinsics with `llvmcall` , then it is “pure Julia”.

Thank you for the clarification. By “pure Julia” I meant in the sense of ForwardDiff.jl’s “generic Julia” explained [in the doc] (Limitations of ForwardDiff · ForwardDiff). I tried using AccurateArithmetic’s `sum_oro()` and `sum_kbn()` in a maximum likelihood estimation using Optim.jl with the `autodiff = :forward` option, and it didn’t work. I am not sure whether it is because `sum_oro()` and `sum_kbn()` are not “generic Julia” to ForwardDiff. I started another thread to discuss the issue there, where I provided a MWE.

b) Seems like there is almost certainly a better way to get gradients with respect to a sum. E.g., overload (type piracy…) to provide a specialized implementation that takes `ForwardDiff.Dual` as inputs.

I am afraid I have no clue here. The MWE in the new thread mentioned above is very simple, but my real work involves high dimension Monte Carlo integrations which make the whole thing more complicated, I think. If there is a straightforward approach, I’ll certainly try.

@goerch PM’d me another solution of doing multi-threading Kahan summation based on KahanSummation.jl. It looks much more elegant and I don’t think @goerch would mind me sharing it here.

(Note that AccurateArithmetic.jl and KahanSummation.jl both export `sum_kbn()`. This post is about the function from KahanSummation.)

It is based on a patch of KahanSummation here. The patch has an outdated function `mapreduce_single` which has to be replaced by `mapreduce_first` for recent versions of Julia. After the update, the content could be loaded using `include("..")`. I made the package readily available on this page; search for the file name `KahanSummation_patch.jl`.

After loading, @goerch showed that it could be extended to construct a multi-threading (or, parallel) function as follows.

``````using InitialValues, Folds

include("KahanSummation_patch.jl")
psum_kbn(f, X) = singleprec(Folds.mapreduce(f, InitialValues.asmonoid(plus_kbn), X))
psum_kbn(X) = psum_kbn(identity, X)
``````

The use of `psum_kbn()` is just the same as `sum_kbn()` and it is much faster.

So, there are currently a couple of high precision summation functions in Julia. In terms of speed, the rank is (slow to fast):

`KahanSummation.sum_kbn()`: < `psum_kbn()` < `AccurateArithmetic.sum_kbn()` = `AccurateArithmetic.sum_oro()`

I believe they are very useful for many applications. However, only `KahanSummation.sum_kbn()`, the slowest variant, is compatible with the AD package ForwardDiff.jl. I am starting another thread to discuss the issue.

Thanks for helping out. I had a short look into the error of this MWE

``````ERROR: MethodError: no method matching accumulate(::Tuple{Vector{ForwardDiff.Dual{ForwardDiff.Tag{var"#17#18", Float64}, Float64, 2}}}, ::AccurateArithmetic.Summation.var"#1#2"{typeof(AccurateArithmetic.EFT.fast_two_sum)}, ::Val{:scalar}, ::Val{2}, ::Val{0})
``````

and noticed `accumulate` restricts the type to reals AFAIU

``````@generated function accumulate(x::NTuple{A, AbstractArray{T}},
accType::F,
rem_handling = Val(:scalar),
::Val{Ushift} = Val(2),
::Val{Prefetch} = Val(0),
)  where {F, A, T <: Union{Float32,Float64}, Ushift, Prefetch}
``````

Is it possible and does it make sense to generalize this?