I am using Julia version 1.7.1. I need to do summations of long series (repeatedly) where accuracy is important. I am looking at KahanSummation’ sum_kbn (pure Julia), but am puzzled by the very slow speed (with a lot of allocations). It is about 248 times slower than the regular sum, 230 times slower than the sum_kbn from AccurateArithmetic (C wrapper), and 62 times slower than my naive implementation of the same algorithm (naive_kbn). The naive_kbn, in turn, is about 3.67 times slower than AccurateArithmetic’s sum_kbn.
I also tried using naive_kbn with Threads.@spawn (thanks to the suggestion of @tkf from another thread), but it seems that the parallel introduces (rounding) errors making the summation not as accurate.
So my question: Is it possible to improve KahanSummation’s sum_kbn, making it much faster than my naive implementation? Or any implementation in pure Julia that has high accuracy with good speed?
Background: I wish to use high-accuracy algorithms in pure Julia because I need it to work with ForwardDiff which requires all the codes in pure Julia. So AccurateArithmetic is out. I could use my naive implementation (naive_kbn), but I still find it slow. I tried to increase the speed using Threads.@spawn, but it seems to introduce rounding errors.
Here is the code to demonstrate the statements above.
using Random, BenchmarkTools, ThreadsX
using KahanSummation, AccurateArithmetic
function test_speed(algo)
Random.seed!(1244)
bb = 0.0
aa1 = rand(500_000)
aa2 = shuffle(-1.0*aa1)
aa = vcat(aa1, aa2) # should sum to 0.0
for i in 1:100
bb += algo(aa)
end
return bb
end
function naive_kbn(x)
s = 0.0
c = 0.0
s = x[1] - c
for i in 2:length(x)
t = s + x[i]
abs(s) >= abs(x[i]) ? c -= ((s-t) + x[i]) : c -= ((x[i] - t) + s)
s = t
end
return s - c
end
function naive_kbn_parallel(xs) # credit to @tkf from another thread
if length(xs) <= 125000
return naive_kbn(xs)
else
t = Threads.@spawn naive_kbn_parallel(@view xs[begin:begin+(end-begin+1)÷2])
return naive_kbn_parallel(@view xs[begin+(end-begin+1)÷2+1:end]) + fetch(t)
end
end
Here are the speed comparisons.
julia> @btime test_speed(sum) # not sum to 0.0
48.526 ms (14 allocations: 19.07 MiB)
-2.9103830456733704e-9
julia> @btime test_speed(AccurateArithmetic.sum_kbn) # not pure Julia
52.519 ms (14 allocations: 19.07 MiB)
0.0
julia> @btime test_speed(KahanSummation.sum_kbn) # very slow, lots allocations
12.052 s (399948714 allocations: 8.96 GiB)
0.0
julia> @btime test_speed(naive_kbn) # could it be faster?
192.889 ms (14 allocations: 19.07 MiB)
0.0
julia> @btime test_speed(naive_kbn_parallel) # good speed but not sum to 0.0
63.096 ms (8885 allocations: 19.75 MiB)
2.9103830456733704e-9
Regarding parallel summation you could do something like
using Random, BenchmarkTools
using KahanSummation, AccurateArithmetic
function test_speed(algo)
Random.seed!(1244)
bb = 0.0
aa1 = rand(500_000)
aa2 = shuffle(-1.0*aa1)
aa = vcat(aa1, aa2) # should sum to 0.0
for i in 1:100
bb += algo(aa)
end
return bb
end
function _naive_kbn(xs)
s = xs[1]
c = 0.0
for i in 2:length(xs)
t = s + xs[i]
c += abs(s) >= abs(xs[i]) ? ((s - t) + xs[i]) : ((xs[i] - t) + s)
s = t
end
s, c
end
function naive_kbn_serial(xs)
s, c = _naive_kbn(xs)
s + c
end
function naive_kbn_parallel(xs) # credit to @tkf from another thread
nt = Threads.nthreads()
ys = Vector{Float64}(undef, 2 * nt)
len = length(xs)
chunk = len ÷ nt
Threads.@threads for i in 1:nt
s, c = _naive_kbn(@view xs[(i - 1) * chunk + 1:min(i * chunk, len)])
ys[2 * i - 1] = s
ys[2 * i] = c
end
naive_kbn_serial(ys)
end
println(test_speed(sum))
println(test_speed(AccurateArithmetic.sum_kbn))
println(test_speed(KahanSummation.sum_kbn))
println(test_speed(naive_kbn_serial))
println(test_speed(naive_kbn_parallel))
@btime test_speed(sum)
@btime test_speed(AccurateArithmetic.sum_kbn)
@btime test_speed(KahanSummation.sum_kbn)
@btime test_speed(naive_kbn_serial)
@btime test_speed(naive_kbn_parallel)
yielding for me
-2.9103830456733704e-9
0.0
0.0
0.0
0.0
26.577 ms (14 allocations: 19.07 MiB)
33.483 ms (14 allocations: 19.07 MiB)
137.400 ms (14 allocations: 19.07 MiB)
139.154 ms (14 allocations: 19.07 MiB)
40.918 ms (2716 allocations: 19.39 MiB)
Interesting. I tried it on another machine, also with Julia 1.7.1, and got the same result (i.e., KahanSummation very slow with a lot of allocations). I created a temporary environment with only the needed packages, same results.
However, if I tried it on another machine that has Julia 1.6.2, I got results similar to yours. So, it seems there are some compatibility issues in KahanSummation.jl with Julia 1.7.1.
The code naive_kbn_parallel() works great but has a small glitch. It omits the last m points in the data series xs where m=mod(length(xs), nt) and nt is the number of threads. I made a quick fix in case this thread was referenced in the future.
function naive_kbn_parallel(xs)
nt = Threads.nthreads()
len = length(xs)
if len >= nt
ys = Vector{Real}(undef, 2 * nt )
chunk = len ÷ nt
Threads.@threads for i in 1:nt
s, c = _naive_kbn(@view xs[(i - 1) * chunk + 1 : min(i * chunk, len)])
ys[2 * i - 1] = s
ys[2 * i] = c
end
if mod(len, nt) > 0 # small (< nt), `sum` is good enough
resid = sum(@view xs[nt*chunk + 1 : end])
else
resid = 0.0
end
return naive_kbn_serial(ys) + resid
else
return sum(xs)
end
end
AccurateArithmetic is implemented in Julia using llvmcall, not C. If your definition of “pure Julia” allows for wrapping LLVM’s SIMD intrinsics with llvmcall, then it is “pure Julia”.
a) VectorizationBase.Vec should be ForwardDiff-compatible, but the support for that isn’t great beyond a few examples I needed. Feel free to file an issue.
b) Seems like there is almost certainly a better way to get gradients with respect to a sum. E.g., overload (type piracy…) to provide a specialized implementation that takes ForwardDiff.Dual as inputs.
AccurateArithmetic is implemented in Julia using llvmcall , not C. If your definition of “pure Julia” allows for wrapping LLVM’s SIMD intrinsics with llvmcall , then it is “pure Julia”.
Thank you for the clarification. By “pure Julia” I meant in the sense of ForwardDiff.jl’s “generic Julia” explained [in the doc] (Limitations of ForwardDiff · ForwardDiff). I tried using AccurateArithmetic’s sum_oro() and sum_kbn() in a maximum likelihood estimation using Optim.jl with the autodiff = :forward option, and it didn’t work. I am not sure whether it is because sum_oro() and sum_kbn() are not “generic Julia” to ForwardDiff. I started another thread to discuss the issue there, where I provided a MWE.
b) Seems like there is almost certainly a better way to get gradients with respect to a sum. E.g., overload (type piracy…) to provide a specialized implementation that takes ForwardDiff.Dual as inputs.
I am afraid I have no clue here. The MWE in the new thread mentioned above is very simple, but my real work involves high dimension Monte Carlo integrations which make the whole thing more complicated, I think. If there is a straightforward approach, I’ll certainly try.
@goerch PM’d me another solution of doing multi-threading Kahan summation based on KahanSummation.jl. It looks much more elegant and I don’t think @goerch would mind me sharing it here.
(Note that AccurateArithmetic.jl and KahanSummation.jl both export sum_kbn(). This post is about the function from KahanSummation.)
It is based on a patch of KahanSummation here. The patch has an outdated function mapreduce_single which has to be replaced by mapreduce_first for recent versions of Julia. After the update, the content could be loaded using include(".."). I made the package readily available on this page; search for the file name KahanSummation_patch.jl.
After loading, @goerch showed that it could be extended to construct a multi-threading (or, parallel) function as follows.
I believe they are very useful for many applications. However, only KahanSummation.sum_kbn(), the slowest variant, is compatible with the AD package ForwardDiff.jl. I am starting another thread to discuss the issue.