Julia equivalent of Python's "fsum" for floating point summation

This discussion brings me to a question: would there be some space and interest in the Julia ecosystem for a CompensatedAlgorithms.jl package?

It could replace and extend KahanSummation.jl by providing several algorithms (not just Kahan-Babuska-Neumaier, but also Ogita-Rump-Oishi, Priest and others). Such a package could also not be limited to summation and cumulated summation, but also provide accurate dot product implementations, or polynomial evaluations for example.

I think we already have all the necessary building blocks (e.g EFTs in ErrorfreeArithmetic.jl or StochasticArithmetic.jl; I haven’t yet taken the time to properly compare both implementations performancewise), and we’d only need to list and implement the algorithms themselves.

If there is some enthusiasm for such a thing, I’m very much willing to take part in its development.

9 Likes

I don’t think your problem comes from the final reduction, but rather from the vectorized summation itself.

After making many tests (most of which making no sense whatsoever), I have come to a strong feeling that LLVM is breaking some of the compensation during its optimization phase.

I need to document this, but you need versions with an e (for “explicitly”) in front: evadd, evsub, evmul instead of vadd, vsub, and vmul, etc.

using SIMDPirates, VectorizationBase
fused_nmul_sub(a, b, c) = SIMDPirates.vsub(SIMDPirates.vmul(SIMDPirates.vsub(a), b), c)
unfused_nmul_sub(a, b, c) = SIMDPirates.evsub(SIMDPirates.evmul(SIMDPirates.vsub(a), b), c)


randvec(::Type{T} = Float64) where {T} = ntuple(Val(VectorizationBase.pick_vector_width(T))) do w Core.VecElement(randn(T)) end

a = randvec(); b = randvec(); c = randvec();

This yields

julia> @code_native fused_nmul_sub(a, b, c)
	.text
; ┌ @ REPL[38]:1 within `fused_nmul_sub'
; │┌ @ SIMDPirates.jl:35 within `vsub'
; ││┌ @ floating_point_arithmetic.jl:58 within `macro expansion'
; │││┌ @ llvmwrap.jl:66 within `llvmwrap' @ llvmwrap.jl:66
; ││││┌ @ REPL[38]:1 within `macro expansion'
	vfnmsub213pd	%zmm2, %zmm1, %zmm0 # zmm0 = -(zmm1 * zmm0) - zmm2
; │└└└└
	retq
	nopw	(%rax,%rax)
; └
	.text
; ┌ @ REPL[39]:1 within `unfused_nmul_sub'
; │┌ @ SIMDPirates.jl:49 within `evmul'
; ││┌ @ floating_point_arithmetic.jl:65 within `macro expansion'
; │││┌ @ llvmwrap.jl:96 within `llvmwrap_notfast' @ llvmwrap.jl:96
; ││││┌ @ REPL[39]:1 within `macro expansion'
	vmulpd	%zmm1, %zmm0, %zmm0
	movabsq	$140629991519192, %rax  # imm = 0x7FE6F8B073D8
	vxorpd	(%rax){1to8}, %zmm0, %zmm0
; │└└└└
; │┌ @ SIMDPirates.jl:49 within `evsub'
; ││┌ @ floating_point_arithmetic.jl:65 within `macro expansion'
; │││┌ @ llvmwrap.jl:96 within `llvmwrap_notfast' @ llvmwrap.jl:96
; ││││┌ @ llvmwrap.jl:115 within `macro expansion'
	vsubpd	%zmm2, %zmm0, %zmm0
; │└└└└
	retq
	nopl	(%rax)
; └

That is 1 vs 4 instructions. Of course, here we don’t want the automatic fusion / floating point associativity, so we have to use the e versions.

I realize now that my problem was that I did not use the e versions for the reduction. Fixing that means we now get the correct answer:

julia> using VectorizationBase, SIMDPirates, KahanSummation, BenchmarkTools

julia> function sum_kahan(x::AbstractArray{T}) where {T <: Union{Float32,Float64}}
           px = pointer(x)
           W, shift = VectorizationBase.pick_vector_width_shift(T)
           sizeT = sizeof(T)
           WT = W * sizeT
           WT4 = 4WT
           N = length(x)
           Base.Cartesian.@nexprs 4 u -> begin
               s_u = SIMDPirates.vbroadcast(Vec{W,T}, zero(T))
               c_u = SIMDPirates.vbroadcast(Vec{W,T}, zero(T))
           end
           Nshift = N >> (shift + 2)
           for n in 0:Nshift-1
               Base.Cartesian.@nexprs 4 u -> begin
                   x_u = SIMDPirates.vload(Vec{W,T}, px + n*WT4 + (u-1)*WT)
                   t_u = SIMDPirates.evadd(s_u, x_u)
                   m_u = SIMDPirates.vless(SIMDPirates.vabs(x_u), SIMDPirates.vabs(s_u))
                   subfrom_u = SIMDPirates.vifelse(m_u, s_u, x_u)
                   addto_u = SIMDPirates.vifelse(m_u, x_u, s_u)
                   c_u = SIMDPirates.evsub(c_u, SIMDPirates.evadd(SIMDPirates.evsub(subfrom_u, t_u), addto_u))
                   s_u = t_u
               end
           end
           
           rem = N & (4W-1)
           offset = Nshift * WT4
           for n in 0:(rem >> shift) - 1
               x_1 = SIMDPirates.vload(Vec{W,T}, px + offset)
               t_1 = SIMDPirates.evadd(s_1, x_1)
               m_1 = SIMDPirates.vless(SIMDPirates.vabs(x_1), SIMDPirates.vabs(s_1))
               subfrom_1 = SIMDPirates.vifelse(m_1, s_1, x_1)
               addto_1 = SIMDPirates.vifelse(m_1, x_1, s_1)
               c_1 = SIMDPirates.evsub(c_1, SIMDPirates.evadd(SIMDPirates.evsub(subfrom_1, t_1), addto_1))
               s_1 = t_1
               offset += WT
           end
       
           rem &= (W-1)
           if rem > 0
               mask = unsafe_trunc(VectorizationBase.mask_type(T), 2^rem - 1)
               
               x_2 = SIMDPirates.vload(Vec{W,T}, px + offset, mask)
               t_2 = SIMDPirates.evadd(s_2, x_2)
               m_2 = SIMDPirates.vless(SIMDPirates.vabs(x_2), SIMDPirates.vabs(s_2))
               subfrom_2 = SIMDPirates.vifelse(m_2, s_2, x_2)
               addto_2 = SIMDPirates.vifelse(m_2, x_2, s_2)
               c_2 = SIMDPirates.evsub(c_2, SIMDPirates.evadd(SIMDPirates.evsub(subfrom_2, t_2), addto_2))
               s_2 = t_2
           end
       
           c_1 = SIMDPirates.evadd(c_1, c_2)
           c_3 = SIMDPirates.evadd(c_3, c_4)
           c_1 = SIMDPirates.evadd(c_1, c_3)
       
           s_1 = SIMDPirates.evadd(s_1, s_2)
           s_3 = SIMDPirates.evadd(s_3, s_4)
           s_1 = SIMDPirates.evadd(s_1, s_3)
           SIMDPirates.vsum(SIMDPirates.evsub(s_1,c_1))
       end
sum_kahan (generic function with 1 method)

julia> x = 10^14 .* sin.((0.001:0.001:1) .* 2π);

julia> @btime Float64(sum(big, $x))
  89.799 μs (3998 allocations: 218.64 KiB)
1.1248384929460264

julia> @btime sum($x)
  41.968 ns (0 allocations: 0 bytes)
0.49223069682955295

julia> @btime sum_kbn($x)
  1.240 μs (0 allocations: 0 bytes)
1.1248384929460276

julia> @btime sum_kahan($x)
  146.561 ns (0 allocations: 0 bytes)
1.1248384929460264

Errors:

julia> relative_error(x, y) = (x - y) / y
relative_error (generic function with 1 method)

julia> relative_error(sum_kbn(x), Float64(sum(big, x)))
9.87006607248485e-16

julia> relative_error(sum_kahan(x), Float64(sum(big, x)))
0.0

julia> relative_error(sum(x), Float64(sum(big, x)))
-0.562398779988078
2 Likes

Thanks for the explanation. It’s nice to be able to select between efficiency and safety.

And thanks also for the corrected implementation.


Well, not really. If we look at the reduction stage, it is formally computing something like:

vsum\left( \sum_{i=1}^4 s_i - c_i\right)

That is a lot of additions, and there might still be catastrophic cancellations in any of them.

Here is your implementation running on my machine. The difference with your results might be explained by different architectures (and, therefore, SIMD widths): I’m using W=4. But I’m pretty sure that you would see this implementation behave badly on Per’s input vectors.

julia> check_fun(x, sum_kahan)
(1.1249173664289736, 7.011982915034423e-5)

So you were right in your initial post: the reduction was indeed the problem; not only because of the lack of explicit instructions, but also because cancellations occur in it.

The following, slightly modified version of your implementation seems to be much more accurate. Again, because I’m more familiar with it, I’m using Knuth’s two_sum to compensate for errors (à la Ogita, Rump & Oishi), but one could probably very well use Dekker’s fast_two_sum instead (which would be more in line with the rest of the KBN algorithm).

import SIMDPirates
function sum_kahan(x::AbstractArray{T}) where {T <: Union{Float32,Float64}}
    Vec = SIMDPirates.Vec

    px = pointer(x)
    W, shift = VectorizationBase.pick_vector_width_shift(T)
    sizeT = sizeof(T)
    WT = W * sizeT
    WT4 = 4WT
    N = length(x)
    Base.Cartesian.@nexprs 4 u -> begin
        s_u = SIMDPirates.vbroadcast(Vec{W,T}, zero(T))
        c_u = SIMDPirates.vbroadcast(Vec{W,T}, zero(T))
    end
    Nshift = N >> (shift + 2)
    for n in 0:Nshift-1
        Base.Cartesian.@nexprs 4 u -> begin
            x_u = SIMDPirates.vload(Vec{W,T}, px + n*WT4 + (u-1)*WT)
            t_u = SIMDPirates.evadd(s_u, x_u)
            m_u = SIMDPirates.vless(SIMDPirates.vabs(x_u), SIMDPirates.vabs(s_u))
            subfrom_u = SIMDPirates.vifelse(m_u, s_u, x_u)
            addto_u = SIMDPirates.vifelse(m_u, x_u, s_u)
            c_u = SIMDPirates.evsub(c_u, SIMDPirates.evadd(SIMDPirates.evsub(subfrom_u, t_u), addto_u))
            s_u = t_u
        end
    end

    rem = N & (4W-1)
    offset = Nshift * WT4
    for n in 0:(rem >> shift) - 1
        x_1 = SIMDPirates.vload(Vec{W,T}, px + offset)
        t_1 = SIMDPirates.evadd(s_1, x_1)
        m_1 = SIMDPirates.vless(SIMDPirates.vabs(x_1), SIMDPirates.vabs(s_1))
        subfrom_1 = SIMDPirates.vifelse(m_1, s_1, x_1)
        addto_1 = SIMDPirates.vifelse(m_1, x_1, s_1)
        c_1 = SIMDPirates.evsub(c_1, SIMDPirates.evadd(SIMDPirates.evsub(subfrom_1, t_1), addto_1))
        s_1 = t_1
        offset += WT
    end

    rem &= (W-1)
    if rem > 0
        mask = unsafe_trunc(VectorizationBase.mask_type(T), 2^rem - 1)

        x_2 = SIMDPirates.vload(Vec{W,T}, px + offset, mask)
        t_2 = SIMDPirates.evadd(s_2, x_2)
        m_2 = SIMDPirates.vless(SIMDPirates.vabs(x_2), SIMDPirates.vabs(s_2))
        subfrom_2 = SIMDPirates.vifelse(m_2, s_2, x_2)
        addto_2 = SIMDPirates.vifelse(m_2, x_2, s_2)
        c_2 = SIMDPirates.evsub(c_2, SIMDPirates.evadd(SIMDPirates.evsub(subfrom_2, t_2), addto_2))
        s_2 = t_2
    end

    c_1 = SIMDPirates.evadd(c_1, c_2)
    c_3 = SIMDPirates.evadd(c_3, c_4)
    c_1 = SIMDPirates.evadd(c_1, c_3)

    # # Original version
    # s_1 = SIMDPirates.evadd(s_1, s_2)
    # s_3 = SIMDPirates.evadd(s_3, s_4)
    # s_1 = SIMDPirates.evadd(s_1, s_3)
    # return SIMDPirates.vsum(SIMDPirates.evsub(s_1, c_1))


    # Modified reduction
    s_1, e = two_sum(s_1, s_2); es = e
    s_3, e = two_sum(s_3, s_4); es = SIMDPirates.evadd(es, e)
    s_1, e = two_sum(s_1, s_3); es = SIMDPirates.evadd(es, e)

    s = zero(T)
    e = SIMDPirates.vsum(es) - SIMDPirates.vsum(c_1)
    for si in s_1
        s, ei = two_sum(s, si.value)
        e += ei
    end

    s + e
end

using SIMDPirates: evadd, evsub
function two_sum(a::T, b::T) where T <: NTuple
    x = evadd(a, b)
    z = evsub(x, a)
    e = evadd(evsub(a, evsub(x, z)),
              evsub(b, z))
    (x, e)
end
julia> x = 10^14 .* sin.((0.001:0.001:1) .* 2π);

julia> check_fun(x, sum_kahan)
(1.1248384929460264, 0.0)

And here is a performance comparison of all variants. On my machine, vectorized KBN seems to perform just slightly worse than vectorized ORO:

julia> x = 5000 |> N->randn(N) .* exp.(10 .* randn(N)) |> x->[x;-x;1.0] |> x->x[sortperm(rand(length(x)))];

julia> @btime sum($x)
  1.310 μs (0 allocations: 0 bytes)
-134.794921875

julia> @btime sum_oro_scal($x)
  19.697 μs (0 allocations: 0 bytes)
0.9999999999997371

julia> @btime sum_oro_vec($x)
  3.351 μs (0 allocations: 0 bytes)
0.9999999999999432

julia> @btime sum_kbn($x)
  20.416 μs (0 allocations: 0 bytes)
0.9999999999997371

julia> @btime sum_kahan($x)
  3.865 μs (0 allocations: 0 bytes)
0.9999999999999432
2 Likes

I just read through your earlier posts more thoroughly – great work!

If you’re referring to all the Base.Cartesian.@nexprs 4 u -> ..., this is because of instruction latency.
If you’d like some reading material, I found Performance Speed Limits interesting and informative. By unrolling 4 times, eg

julia> using MacroTools

julia> prettify(@macroexpand Base.Cartesian.@nexprs 4 u -> begin
                   xi = SIMD.vload(Vec{W,T}, x, offset)
                   s_u, ei = two_sum(s_u, xi)
                   e_u += ei
                   offset += W
               end)
quote
    xi = SIMD.vload(Vec{W, T}, x, offset)
    (s_1, ei) = two_sum(s_1, xi)
    e_1 += ei
    offset += W
    xi = SIMD.vload(Vec{W, T}, x, offset)
    (s_2, ei) = two_sum(s_2, xi)
    e_2 += ei
    offset += W
    xi = SIMD.vload(Vec{W, T}, x, offset)
    (s_3, ei) = two_sum(s_3, xi)
    e_3 += ei
    offset += W
    xi = SIMD.vload(Vec{W, T}, x, offset)
    (s_4, ei) = two_sum(s_4, xi)
    e_4 += ei
    offset += W
end

We are no longer bottle necked by instruction latency. Agner Fog’s instruction tables say that on Skylake(-X) the vectorized add, sub, mul, and fma instructions each have a latency of 4 clock cycles. These vary a little from one CPU family to another.
The Intel chips and 7 nm Ryzens should also be able to issue 2 instructions per clock cycle.

So that means to maximize performance, we need to have up to 8 instructions going at a time.

A simple example to see this, a rather naive sum function:

using VectorizationBase, SIMDPirates

@generated function usum(x::AbstractArray{T}, ::Val{Ushift} = Val(2)) where {T,Ushift}
    @assert 0 ≤ Ushift < 6
    U = 1 << Ushift
    W, shift = VectorizationBase.pick_vector_width_shift(T)
    totalshift = shift + Ushift
    WU = 1 << totalshift
    WT = W * sizeof(T)
    q = quote
        ptrx = pointer(x)
        Base.Cartesian.@nexprs $U u -> begin
            s_u = SIMDPirates.vbroadcast(Vec{$W,$T}, zero($T))
        end
        offset = 0
        N = length(x)
        for i in 1:(N >> $totalshift)
            Base.Cartesian.@nexprs $U u -> begin
                s_u = SIMDPirates.vadd(s_u, SIMDPirates.vload(Vec{$W,$T}, ptrx + offset))
                offset += $WT
            end
        end
        rem = N & $(WU - 1)
        for i in 1:(rem >> $shift)
            s_1 = SIMDPirates.vadd(s_1, SIMDPirates.vload(Vec{$W,$T}, ptrx + offset))
            offset += $WT
        end
        rem &= $(W-1)
        if rem > 0
            s_1 = SIMDPirates.vadd(s_1, SIMDPirates.vload(Vec{$W,$T}, ptrx + offset), VectorizationBase.mask(Val{$W}(), rem))
        end
    end
    Uh = U
    while Uh > 1
        Uh >>= 1
        push!(
            q.args,
            quote
            Base.Cartesian.@nexprs $Uh u -> begin
            s_u = SIMDPirates.vadd(s_u, s_{u + $Uh})
            end
            end
        )
    end
    push!(q.args, :(SIMDPirates.vsum(s_1)))
    q
end

I wanted the unroll factor to be a power of 2; so the function takes a Val argument Ushift, and we unroll by 2^Ushift = 1<<Ushift (on top of the vectorization unrolling).

We need large vectors to see benefit. Picking the largest vector that still fits in the L1 cache:

julia> first(VectorizationBase.CACHE_SIZE) ÷ sizeof(Float64)
4096

julia> x = rand(ans);

and running benchmarks:

julia> using BenchmarkTools

julia> @btime sum($x)
  296.868 ns (0 allocations: 0 bytes)
2036.5329867003175

julia> @btime usum($x, Val(0)) # unroll = 1 (no unrolling)
  471.189 ns (0 allocations: 0 bytes)
2036.5329867003172

julia> @btime usum($x, Val(1)) # unroll = 2^1 = 2
  249.814 ns (0 allocations: 0 bytes)
2036.532986700318

julia> @btime usum($x, Val(2)) # unroll = 2^2 = 4
  132.902 ns (0 allocations: 0 bytes)
2036.532986700318

julia> @btime usum($x, Val(3)) # unroll = 2^3 = 8
  84.432 ns (0 allocations: 0 bytes)
2036.532986700318

julia> @btime usum($x, Val(4)) # unroll = 2^4 = 16
  89.227 ns (0 allocations: 0 bytes)
2036.532986700318

julia> @btime usum($x, Val(5)) # unroll = 2^5 = 32
  108.952 ns (0 allocations: 0 bytes)
2036.532986700318

We can see that performance does indeed peak when we unroll by 8, which breaks up the dependency chain enough so that 8 vadd instructions are running at a time.

The difference is much less for smaller vectors. And if we’re summing less than 64 elements, unrolling by 8 means we’re actually not unrolling at all.

I actually realize now that I was too aggressive with the choice of 4. Our functions have a lot more going on than just repetitive adds, giving the CPU more to do without the need to break up dependency chains. Even just focusing on floating point arithmetic, your two_sum function can take advantage of order execution:

# D. E. Knuth, The Art of Computer Programming: Seminumerical Algorithms, 1969.
function two_sum(a::T, b::T) where T <: Union{Real, Vec}
    x = a + b
    z = x - a
    e = (a - (x-z)) + (b-z)
    (x, e)
end

x and z, and then x-z and b-z can each be calculated at the same time, for example.

As always, make hypothesis to try and understand what’s going on, and then test them. I modified your improvements to sum_kahan so that it will take arbitrary unrolls, like the simple sum function from earlier:

using VectorizationBase, SIMDPirates
using SIMDPirates: evadd, evsub
function two_sum(a::T, b::T) where T <: NTuple
    x = evadd(a, b)
    z = evsub(x, a)
    e = evadd(evsub(a, evsub(x, z)),
              evsub(b, z))
    (x, e)
end
function two_sum(a::T, b::T) where T <: Real
    x = a + b
    z = x - a
    e = (a - (x-z)) + (b-z)
    (x, e)
end

@generated function sum_kahan(x::AbstractArray{T}, ::Val{Ushift} = Val{1}()) where {T <: Union{Float32,Float64}, Ushift}
    @assert 0 ≤ Ushift < 6
    U = 1 << Ushift
    
    W, shift = VectorizationBase.pick_vector_width_shift(T)
    sizeT = sizeof(T)
    WT = W * sizeT
    WU = W*U
    WTU = WT*U
    V = SIMDPirates.Vec{W,T}

    q = quote
        px = pointer(x)
        N = length(x)
        Base.Cartesian.@nexprs $U u -> begin
            s_u = SIMDPirates.vbroadcast($V, zero($T))
            c_u = SIMDPirates.vbroadcast($V, zero($T))
        end
        Nshift = N >> $(shift + Ushift)
        offset = 0
        for n in 1:Nshift
            Base.Cartesian.@nexprs $U u -> begin
                x_u = SIMDPirates.vload($V, px + offset)
                t_u = SIMDPirates.evadd(s_u, x_u)
                m_u = SIMDPirates.vless(SIMDPirates.vabs(x_u), SIMDPirates.vabs(s_u))
                subfrom_u = SIMDPirates.vifelse(m_u, s_u, x_u)
                addto_u = SIMDPirates.vifelse(m_u, x_u, s_u)
                c_u = SIMDPirates.evsub(c_u, SIMDPirates.evadd(SIMDPirates.evsub(subfrom_u, t_u), addto_u))
                s_u = t_u
                offset += $WT
            end
        end

        rem = N & $(WU-1)
        for n in 1:(rem >> $shift)
            x_1 = SIMDPirates.vload($V, px + offset)
            t_1 = SIMDPirates.evadd(s_1, x_1)
            m_1 = SIMDPirates.vless(SIMDPirates.vabs(x_1), SIMDPirates.vabs(s_1))
            subfrom_1 = SIMDPirates.vifelse(m_1, s_1, x_1)
            addto_1 = SIMDPirates.vifelse(m_1, x_1, s_1)
            c_1 = SIMDPirates.evsub(c_1, SIMDPirates.evadd(SIMDPirates.evsub(subfrom_1, t_1), addto_1))
            s_1 = t_1
            offset += $WT
        end

        rem &= $(W-1)
        if rem > 0
            mask = VectorizationBase.mask(Val{$W}(), rem)

            x_2 = SIMDPirates.vload($V, px + offset, mask)
            t_2 = SIMDPirates.evadd(s_2, x_2)
            m_2 = SIMDPirates.vless(SIMDPirates.vabs(x_2), SIMDPirates.vabs(s_2))
            subfrom_2 = SIMDPirates.vifelse(m_2, s_2, x_2)
            addto_2 = SIMDPirates.vifelse(m_2, x_2, s_2)
            c_2 = SIMDPirates.evsub(c_2, SIMDPirates.evadd(SIMDPirates.evsub(subfrom_2, t_2), addto_2))
            s_2 = t_2
        end
        es = SIMDPirates.vbroadcast($V, zero($T))
    end
    Uh = U
    while Uh > 1
        Uh >>= 1
        qtemp = quote
            Base.Cartesian.@nexprs $Uh u -> begin
                c_u = SIMDPirates.evadd(c_u, c_{u+$Uh})
                (s_u, e) = two_sum(s_u, s_{u+$Uh})
                es = SIMDPirates.evadd(es, e)
            end
        end
        push!(q.args, qtemp)
    end

    qscalarreduce = quote
        s = zero(T)
        e = SIMDPirates.vsum(es) - SIMDPirates.vsum(c_1)
        for si in s_1
            s, ei = two_sum(s, si.value)
            e += ei
        end

        s + e
    end
    push!(q.args, qscalarreduce)
    q
end

Now, we see that 2^1 = 2, and not 4 as I used, leads to the best performance:

julia> y = 10^14 .* sin.(LinRange(0, 2π, 4096));

julia> @btime sum($y)
  282.450 ns (0 allocations: 0 bytes)
16.0

julia> @btime sum(big, $y)
  375.410 μs (16382 allocations: 895.89 KiB)
-0.21548719867825956442164425652663339860737323760986328125

julia> @btime sum_kbn($y)
  5.098 μs (0 allocations: 0 bytes)
-0.21548719867826094

julia> @btime sum_kahan($y, Val(0))
  738.659 ns (0 allocations: 0 bytes)
-0.21548719867825916

julia> @btime sum_kahan($y, Val(1))
  607.716 ns (0 allocations: 0 bytes)
-0.21548719867825916

julia> @btime sum_kahan($y, Val(2))
  623.291 ns (0 allocations: 0 bytes)
-0.21548719867825916

julia> @btime sum_kahan($y, Val(3))
  631.864 ns (0 allocations: 0 bytes)
-0.2154871986782596

I’d guess that sum_oro_vec will also see slightly better performance with 2x instead of 4x.

6 Likes

I am enjoying the exploration that you (collectively) have embarked upon. For the benefit of those less familiar with all this, when you have found a good, somewhat transportable approach … do let us know rather than assuming all who are interested would get the message from reading each part of the development. Thanks.

10 Likes

Just a short note to let everyone interested know that @elrod and I have been working (even though not too much) since the last posts.

In short, here is a summary of where we are:

  1. qualitywise, the accuracy of the algorithms can be tested against vectors of constant size (200 elements in the results below), but of increasing condition number.

In the graph above, we see the relative error vary as a function of the condition number, in a log-log scale. Errors lower than \epsilon are arbitrarily set to \epsilon; conversely, when the relative error is more than 100% (i.e no digit is correctly computed anymore), the error is capped there in order to avoid affecting the scale of the graph too much. What we see is that the pairwise summation algorithm (as implemented in Base.sum) starts losing accuracy as soon as the condition number increases, computing only noise when the condition number exceeds \frac{1}{\epsilon} \simeq 10^{16}. In contrast, both compensated algorithms (Kahan-Babuska-Neumaier and Ogita-Rump-Oishi) still accurately compute the result at this point, and start losing accuracy there, computing meaningless results when the condition nuber reaches \frac{1}{\epsilon^2}\simeq 10^{32}. In effect these (simply) compensated algorithms produce the same results as if a naive summation had been performed with twice the working precision (128 bits) in this case, and then rounded to 64-bit floats.

This is conform to what the theory predicts, and validates the correctness of the implementations.

  1. performancewise, the implementations can be tested on a variety of vector sizes, and compared against the performance of the standard pairwise summation.

perf

In the graph above, the time spent in the summation (renormalized per element) is plotted against the vector size (the units in the y-axis label should be “ns/elem”). What we see with the standard summation is that, once vectors start having significant sizes (say, more than 1000 elements), the implementation is memory bound (as expected of a typical BLAS1 operation). Which is why we see significant decreases in the performance when the vector can’t fit into the L2 cache (around 30k elements, or 256kB on my machine) or the L3 cache (around 400k elements, or 3MB on y machine).

The Ogita-Rump-Oishi algorithm, when implemented with a suitable unrolling level (ushift=2, i.e 2^2=4 unrolled iterations), is CPU-bound when vectors fit inside the cache. However, when vectors are to large to fit into the L3 cache, the implementation becomes memory-bound again (on my system), which means we get the same performance as the standard summation.

In other words, the improved accuracy is free for sufficiently large vectors. For smaller vectors, the accuracy comes with a slow-down that can reach values slightly above 3 for vectors which fit in the L2 cache.

EDIT: one last detail: when benchmarking summation algorithms, I found the performance of Base.sum to vary wildly from machine to machine, even with the exact same Julia version. It looks like, on some systems, the linear summation (that occurs in the leafs of the pairwise summation tree, when blocks are sufficiently small) is vectorized. But on some other systems, it is not. And I could not determine any pattern to say what caused the vectorization to happen or to fail.

Is that a known issue?

11 Likes

In complement to the post above, I think we’ve now reached the point where the code to be shared is too complex to comfortably fit in a discourse thread. I would like to start implementing the CompensatedAlgorithms.jl package I mentioned a few days ago, and I could of course create it in github, either under my own account, or in one of the organizations to which I have access.

However, I feel like JuliaMath would be a perfect home for such a package, and I’m not provileged enough to contribute to packages there (let alone to create one). So I was wondering: should I create the CompensatedAlgorithms.jl package somewhere I have access to (and possibly move it later), or would it be more suitable to create it directly under the JuliaMath organization? In the latter case, how could this be done?

4 Likes

I think you can always ask JuliaMath to include your package at some point later on, it does not need to be created there.

Great work!

This is so great. :slight_smile:

That’s entirely possible. Newer chips have much more capable SIMD vectorization units. The newest instruction sets are AVX2 (most new laptop chips) and AVX512 (only on beefy Xeon server chips). You can check which instruction sets your chips implement in the “flags” fields of /proc/cpuinfo (on Linux) or sysctl -a | grep machdep.cpu.features (on macOS).

I have invited you to be a collaborator on https://github.com/JuliaMath/AccurateArithmetic.jl, which I had created to support DoubleFloats, a purpose for which it is no longer used. I cleaned out most or all of it. Let’s use this repo to provide implementations of compensated arithmetic algorithms.

1 Like

Thanks! The code used to generate the figures above is now provided in this repo.

I’m now trying to move forward and implement a dot product along the same lines. If everything goes according to the plan, it should be fairly easy to reuse most of the summation logic and share the vast majority of the code. We’ll see what happens in terms of performance, but I would be surprised if we still had the “free lunch” situation of the summation: since the dot product is a bit more arithmetically intensive, it might remain CPU-bound even outside the L3 cache.

Thanks. I also suspected something related to chip capabilities, but it appears to not be as simple as a avx2 vs avx512 thing. I’ve seen performance variations happen within each class of CPU.

Here is an experiment that is simple to perform, now that the explicitly vectorized code is easily available:

shell> grep "model name" /proc/cpuinfo | tail -n1
model name      : Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz

shell> grep "flags" /proc/cpuinfo | tail -n1                                                                            
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm con
stant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx1
6 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti retpoline rsb_ctxsw tpr_shadow v
nmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln 
pts hwp hwp_notify hwp_act_window hwp_epp                                                                                                                                

shell> git clone https://github.com/JuliaMath/AccurateArithmetic.jl.git
shell> cd AccurateArithmetic.jl/
shell> julia --project

julia> using Pkg; Pkg.instantiate()
julia> using AccurateArithmetic
julia> using BenchmarkTools

julia> x = rand(10_000);

julia> @btime sum($x)
  1.342 μs (0 allocations: 0 bytes)
4986.434017383861

julia> @btime sum_oro($x)
  3.410 μs (0 allocations: 0 bytes)
4986.434017383861

On this machine (a core i5), sum is 2-3x faster than sum_oro. On very similar machines, I’ve seen the opposite happen. I’d be grateful if some discourse users could perform the same kind of test and publish their results; maybe that would help understand what happens and in which conditions.

But maybe this would be best posted in a separate discourse thread?

I’m not sure this is what you had in mind, but @elrod and I co-authored a paper that we’re submitting to the Correctness 2019 workshop (to be held in conjunction with SC’19). A preliminary version of the preprint is available here for anyone interested:

(the deadline for submissions has been extended to August 19th, so we still have some time to take your comments / questions into account if you have some)

9 Likes

That is precisely what I had hoped you would share. :smile:

1 Like

Commenting on this tangent:

It’s many orderS of magnitude slower than sum (I didn’t try xsum), without sort: AND gets less precise answers as it seems to disable default sum algorithm?!

[Better would be to sort by absolute magnitude, but I get away with this here. And I’m just showing for timing, and then saw sorting gave same answer so not needed, at least for my array a.]

julia> @time sum(sort(a))  # is only 27 times slower (for that many numbers, didn't try radixsort
  0.000804 seconds (7 allocations: 78.375 KiB)
0.0

julia> @time sum(big.(a))  # sum for non-big must use a different algorithm, not a straight sum over array, as it gave 0.0? Error can be much larger with big:
  0.043928 seconds (20.01 k allocations: 1.145 MiB)
-15.744873851312446018712308593471442723857083704853153190993195048996768170998

It shouldn’t be worse as here (meaning without sort), when a higher precision datatype is used. While big alone wouldn’t help much for precision (not speed), it seems it should be a bit better, all else equal, if that where the case…