Speed issue with KahanSummation

I am using Julia version 1.7.1. I need to do summations of long series (repeatedly) where accuracy is important. I am looking at KahanSummation’ sum_kbn (pure Julia), but am puzzled by the very slow speed (with a lot of allocations). It is about 248 times slower than the regular sum, 230 times slower than the sum_kbn from AccurateArithmetic (C wrapper), and 62 times slower than my naive implementation of the same algorithm (naive_kbn). The naive_kbn, in turn, is about 3.67 times slower than AccurateArithmetic’s sum_kbn.

I also tried using naive_kbn with Threads.@spawn (thanks to the suggestion of @tkf from another thread), but it seems that the parallel introduces (rounding) errors making the summation not as accurate.

So my question: Is it possible to improve KahanSummation’s sum_kbn, making it much faster than my naive implementation? Or any implementation in pure Julia that has high accuracy with good speed?

Background: I wish to use high-accuracy algorithms in pure Julia because I need it to work with ForwardDiff which requires all the codes in pure Julia. So AccurateArithmetic is out. I could use my naive implementation (naive_kbn), but I still find it slow. I tried to increase the speed using Threads.@spawn, but it seems to introduce rounding errors.

Here is the code to demonstrate the statements above.

using Random, BenchmarkTools, ThreadsX
using KahanSummation, AccurateArithmetic

function test_speed(algo)
    Random.seed!(1244)
    bb = 0.0
    aa1 = rand(500_000)
    aa2 = shuffle(-1.0*aa1)        
    aa  = vcat(aa1, aa2)  # should sum to 0.0
    for i in 1:100
        bb += algo(aa)
    end
    return bb
end

function naive_kbn(x)  
    s = 0.0
    c = 0.0
    s = x[1] - c
    for i in 2:length(x)
      t = s + x[i]
      abs(s) >= abs(x[i]) ? c -= ((s-t) + x[i]) : c -= ((x[i] - t) + s)
      s = t
    end 
    return s - c
end

function naive_kbn_parallel(xs)  # credit to @tkf from another thread
    if length(xs) <= 125000
        return naive_kbn(xs)  
    else
        t = Threads.@spawn naive_kbn_parallel(@view xs[begin:begin+(end-begin+1)÷2])
        return naive_kbn_parallel(@view xs[begin+(end-begin+1)÷2+1:end]) + fetch(t)
    end
end

Here are the speed comparisons.

julia> @btime test_speed(sum)  # not sum to 0.0
  48.526 ms (14 allocations: 19.07 MiB)
-2.9103830456733704e-9

julia> @btime test_speed(AccurateArithmetic.sum_kbn) # not pure Julia
  52.519 ms (14 allocations: 19.07 MiB)
0.0

julia> @btime test_speed(KahanSummation.sum_kbn) # very slow, lots allocations
  12.052 s (399948714 allocations: 8.96 GiB)
0.0

julia> @btime test_speed(naive_kbn)   # could it be faster?
  192.889 ms (14 allocations: 19.07 MiB)
0.0

julia> @btime test_speed(naive_kbn_parallel)  # good speed but not sum to 0.0
  63.096 ms (8885 allocations: 19.75 MiB)
2.9103830456733704e-9

I can’t reproduce this on

Julia Version 1.8.0-DEV.1309
Commit 89f23325aa (2022-01-13 19:48 UTC)        
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-10710U CPU @ 1.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.0 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 5

seeing

  27.343 ms (14 allocations: 19.07 MiB)
  39.657 ms (14 allocations: 19.07 MiB)
  160.417 ms (14 allocations: 19.07 MiB)
  159.576 ms (14 allocations: 19.07 MiB)
  50.452 ms (8903 allocations: 19.75 MiB)

Regarding parallel summation you could do something like

using Random, BenchmarkTools
using KahanSummation, AccurateArithmetic

function test_speed(algo)
    Random.seed!(1244)
    bb = 0.0
    aa1 = rand(500_000)
    aa2 = shuffle(-1.0*aa1)        
    aa  = vcat(aa1, aa2)  # should sum to 0.0
    for i in 1:100
        bb += algo(aa)
    end
    return bb
end

function _naive_kbn(xs)  
    s = xs[1]
    c = 0.0
    for i in 2:length(xs)
      t = s + xs[i]
      c += abs(s) >= abs(xs[i]) ? ((s - t) + xs[i]) : ((xs[i] - t) + s)
      s = t
    end 
    s, c
end

function naive_kbn_serial(xs)
    s, c = _naive_kbn(xs)
    s + c
end

function naive_kbn_parallel(xs)  # credit to @tkf from another thread
    nt = Threads.nthreads()
    ys = Vector{Float64}(undef, 2 * nt)
    len = length(xs)
    chunk = len ÷ nt
    Threads.@threads for i in 1:nt
        s, c = _naive_kbn(@view xs[(i - 1) * chunk + 1:min(i * chunk, len)])
        ys[2 * i - 1] = s
        ys[2 * i] = c
    end
    naive_kbn_serial(ys)
end

println(test_speed(sum))
println(test_speed(AccurateArithmetic.sum_kbn))
println(test_speed(KahanSummation.sum_kbn))
println(test_speed(naive_kbn_serial))
println(test_speed(naive_kbn_parallel))

@btime test_speed(sum)
@btime test_speed(AccurateArithmetic.sum_kbn)
@btime test_speed(KahanSummation.sum_kbn) 
@btime test_speed(naive_kbn_serial)   
@btime test_speed(naive_kbn_parallel)  

yielding for me

-2.9103830456733704e-9
0.0
0.0
0.0
0.0
  26.577 ms (14 allocations: 19.07 MiB)
  33.483 ms (14 allocations: 19.07 MiB)
  137.400 ms (14 allocations: 19.07 MiB)
  139.154 ms (14 allocations: 19.07 MiB)
  40.918 ms (2716 allocations: 19.39 MiB)

Interesting. I tried it on another machine, also with Julia 1.7.1, and got the same result (i.e., KahanSummation very slow with a lot of allocations). I created a temporary environment with only the needed packages, same results.

However, if I tried it on another machine that has Julia 1.6.2, I got results similar to yours. So, it seems there are some compatibility issues in KahanSummation.jl with Julia 1.7.1.

It looks promising! It’s an acceptable balance between speed and accuracy. Many thanks.

Filed an issue.

The code naive_kbn_parallel() works great but has a small glitch. It omits the last m points in the data series xs where m=mod(length(xs), nt) and nt is the number of threads. I made a quick fix in case this thread was referenced in the future.

function naive_kbn_parallel(xs)  
    nt  = Threads.nthreads()
    len = length(xs)

    if len >= nt 
        ys = Vector{Real}(undef, 2 * nt )
        chunk = len ÷ nt
        Threads.@threads for i in 1:nt
            s, c = _naive_kbn(@view xs[(i - 1) * chunk + 1 : min(i * chunk, len)])
            ys[2 * i - 1] = s
            ys[2 * i] = c
        end

        if mod(len, nt) > 0  #  small (< nt), `sum` is good enough
            resid = sum(@view xs[nt*chunk + 1 : end])
        else 
            resid = 0.0
        end

        return naive_kbn_serial(ys) + resid
    else
        return sum(xs)    
    end
end
1 Like

AccurateArithmetic is implemented in Julia using llvmcall, not C. If your definition of “pure Julia” allows for wrapping LLVM’s SIMD intrinsics with llvmcall, then it is “pure Julia”.

We discuss its implementation here.

Our approach for vector reductions can also be applied to reducing parallel threads.

3 Likes

a) VectorizationBase.Vec should be ForwardDiff-compatible, but the support for that isn’t great beyond a few examples I needed. Feel free to file an issue.
b) Seems like there is almost certainly a better way to get gradients with respect to a sum. E.g., overload (type piracy…) to provide a specialized implementation that takes ForwardDiff.Dual as inputs.

3 Likes

AccurateArithmetic is implemented in Julia using llvmcall , not C. If your definition of “pure Julia” allows for wrapping LLVM’s SIMD intrinsics with llvmcall , then it is “pure Julia”.

Thank you for the clarification. By “pure Julia” I meant in the sense of ForwardDiff.jl’s “generic Julia” explained [in the doc] (Limitations of ForwardDiff · ForwardDiff). I tried using AccurateArithmetic’s sum_oro() and sum_kbn() in a maximum likelihood estimation using Optim.jl with the autodiff = :forward option, and it didn’t work. I am not sure whether it is because sum_oro() and sum_kbn() are not “generic Julia” to ForwardDiff. I started another thread to discuss the issue there, where I provided a MWE.

b) Seems like there is almost certainly a better way to get gradients with respect to a sum. E.g., overload (type piracy…) to provide a specialized implementation that takes ForwardDiff.Dual as inputs.

I am afraid I have no clue here. The MWE in the new thread mentioned above is very simple, but my real work involves high dimension Monte Carlo integrations which make the whole thing more complicated, I think. If there is a straightforward approach, I’ll certainly try.

@goerch PM’d me another solution of doing multi-threading Kahan summation based on KahanSummation.jl. It looks much more elegant and I don’t think @goerch would mind me sharing it here.

(Note that AccurateArithmetic.jl and KahanSummation.jl both export sum_kbn(). This post is about the function from KahanSummation.)

It is based on a patch of KahanSummation here. The patch has an outdated function mapreduce_single which has to be replaced by mapreduce_first for recent versions of Julia. After the update, the content could be loaded using include(".."). I made the package readily available on this page; search for the file name KahanSummation_patch.jl.

After loading, @goerch showed that it could be extended to construct a multi-threading (or, parallel) function as follows.

using InitialValues, Folds

include("KahanSummation_patch.jl")
psum_kbn(f, X) = singleprec(Folds.mapreduce(f, InitialValues.asmonoid(plus_kbn), X))
psum_kbn(X) = psum_kbn(identity, X)

The use of psum_kbn() is just the same as sum_kbn() and it is much faster.

So, there are currently a couple of high precision summation functions in Julia. In terms of speed, the rank is (slow to fast):

KahanSummation.sum_kbn(): < psum_kbn() < AccurateArithmetic.sum_kbn() = AccurateArithmetic.sum_oro()

I believe they are very useful for many applications. However, only KahanSummation.sum_kbn(), the slowest variant, is compatible with the AD package ForwardDiff.jl. I am starting another thread to discuss the issue.

Thanks for helping out. I had a short look into the error of this MWE

ERROR: MethodError: no method matching accumulate(::Tuple{Vector{ForwardDiff.Dual{ForwardDiff.Tag{var"#17#18", Float64}, Float64, 2}}}, ::AccurateArithmetic.Summation.var"#1#2"{typeof(AccurateArithmetic.EFT.fast_two_sum)}, ::Val{:scalar}, ::Val{2}, ::Val{0})

and noticed accumulate restricts the type to reals AFAIU

@generated function accumulate(x::NTuple{A, AbstractArray{T}},
                               accType::F,
                               rem_handling = Val(:scalar),
                               ::Val{Ushift} = Val(2),
                               ::Val{Prefetch} = Val(0),
                               )  where {F, A, T <: Union{Float32,Float64}, Ushift, Prefetch}

Is it possible and does it make sense to generalize this?

Added some more information to the issue and made a PR.

2 Likes