Julia equivalent of Python's "fsum" for floating point summation

This discussion brings me to a question: would there be some space and interest in the Julia ecosystem for a CompensatedAlgorithms.jl package?

It could replace and extend KahanSummation.jl by providing several algorithms (not just Kahan-Babuska-Neumaier, but also Ogita-Rump-Oishi, Priest and others). Such a package could also not be limited to summation and cumulated summation, but also provide accurate dot product implementations, or polynomial evaluations for example.

I think we already have all the necessary building blocks (e.g EFTs in ErrorfreeArithmetic.jl or StochasticArithmetic.jl; I haven’t yet taken the time to properly compare both implementations performancewise), and we’d only need to list and implement the algorithms themselves.

If there is some enthusiasm for such a thing, I’m very much willing to take part in its development.

8 Likes

I don’t think your problem comes from the final reduction, but rather from the vectorized summation itself.

After making many tests (most of which making no sense whatsoever), I have come to a strong feeling that LLVM is breaking some of the compensation during its optimization phase.

I need to document this, but you need versions with an e (for “explicitly”) in front: evadd, evsub, evmul instead of vadd, vsub, and vmul, etc.

using SIMDPirates, VectorizationBase
fused_nmul_sub(a, b, c) = SIMDPirates.vsub(SIMDPirates.vmul(SIMDPirates.vsub(a), b), c)
unfused_nmul_sub(a, b, c) = SIMDPirates.evsub(SIMDPirates.evmul(SIMDPirates.vsub(a), b), c)


randvec(::Type{T} = Float64) where {T} = ntuple(Val(VectorizationBase.pick_vector_width(T))) do w Core.VecElement(randn(T)) end

a = randvec(); b = randvec(); c = randvec();

This yields

julia> @code_native fused_nmul_sub(a, b, c)
	.text
; ┌ @ REPL[38]:1 within `fused_nmul_sub'
; │┌ @ SIMDPirates.jl:35 within `vsub'
; ││┌ @ floating_point_arithmetic.jl:58 within `macro expansion'
; │││┌ @ llvmwrap.jl:66 within `llvmwrap' @ llvmwrap.jl:66
; ││││┌ @ REPL[38]:1 within `macro expansion'
	vfnmsub213pd	%zmm2, %zmm1, %zmm0 # zmm0 = -(zmm1 * zmm0) - zmm2
; │└└└└
	retq
	nopw	(%rax,%rax)
; └
	.text
; ┌ @ REPL[39]:1 within `unfused_nmul_sub'
; │┌ @ SIMDPirates.jl:49 within `evmul'
; ││┌ @ floating_point_arithmetic.jl:65 within `macro expansion'
; │││┌ @ llvmwrap.jl:96 within `llvmwrap_notfast' @ llvmwrap.jl:96
; ││││┌ @ REPL[39]:1 within `macro expansion'
	vmulpd	%zmm1, %zmm0, %zmm0
	movabsq	$140629991519192, %rax  # imm = 0x7FE6F8B073D8
	vxorpd	(%rax){1to8}, %zmm0, %zmm0
; │└└└└
; │┌ @ SIMDPirates.jl:49 within `evsub'
; ││┌ @ floating_point_arithmetic.jl:65 within `macro expansion'
; │││┌ @ llvmwrap.jl:96 within `llvmwrap_notfast' @ llvmwrap.jl:96
; ││││┌ @ llvmwrap.jl:115 within `macro expansion'
	vsubpd	%zmm2, %zmm0, %zmm0
; │└└└└
	retq
	nopl	(%rax)
; └

That is 1 vs 4 instructions. Of course, here we don’t want the automatic fusion / floating point associativity, so we have to use the e versions.

I realize now that my problem was that I did not use the e versions for the reduction. Fixing that means we now get the correct answer:

julia> using VectorizationBase, SIMDPirates, KahanSummation, BenchmarkTools

julia> function sum_kahan(x::AbstractArray{T}) where {T <: Union{Float32,Float64}}
           px = pointer(x)
           W, shift = VectorizationBase.pick_vector_width_shift(T)
           sizeT = sizeof(T)
           WT = W * sizeT
           WT4 = 4WT
           N = length(x)
           Base.Cartesian.@nexprs 4 u -> begin
               s_u = SIMDPirates.vbroadcast(Vec{W,T}, zero(T))
               c_u = SIMDPirates.vbroadcast(Vec{W,T}, zero(T))
           end
           Nshift = N >> (shift + 2)
           for n in 0:Nshift-1
               Base.Cartesian.@nexprs 4 u -> begin
                   x_u = SIMDPirates.vload(Vec{W,T}, px + n*WT4 + (u-1)*WT)
                   t_u = SIMDPirates.evadd(s_u, x_u)
                   m_u = SIMDPirates.vless(SIMDPirates.vabs(x_u), SIMDPirates.vabs(s_u))
                   subfrom_u = SIMDPirates.vifelse(m_u, s_u, x_u)
                   addto_u = SIMDPirates.vifelse(m_u, x_u, s_u)
                   c_u = SIMDPirates.evsub(c_u, SIMDPirates.evadd(SIMDPirates.evsub(subfrom_u, t_u), addto_u))
                   s_u = t_u
               end
           end
           
           rem = N & (4W-1)
           offset = Nshift * WT4
           for n in 0:(rem >> shift) - 1
               x_1 = SIMDPirates.vload(Vec{W,T}, px + offset)
               t_1 = SIMDPirates.evadd(s_1, x_1)
               m_1 = SIMDPirates.vless(SIMDPirates.vabs(x_1), SIMDPirates.vabs(s_1))
               subfrom_1 = SIMDPirates.vifelse(m_1, s_1, x_1)
               addto_1 = SIMDPirates.vifelse(m_1, x_1, s_1)
               c_1 = SIMDPirates.evsub(c_1, SIMDPirates.evadd(SIMDPirates.evsub(subfrom_1, t_1), addto_1))
               s_1 = t_1
               offset += WT
           end
       
           rem &= (W-1)
           if rem > 0
               mask = unsafe_trunc(VectorizationBase.mask_type(T), 2^rem - 1)
               
               x_2 = SIMDPirates.vload(Vec{W,T}, px + offset, mask)
               t_2 = SIMDPirates.evadd(s_2, x_2)
               m_2 = SIMDPirates.vless(SIMDPirates.vabs(x_2), SIMDPirates.vabs(s_2))
               subfrom_2 = SIMDPirates.vifelse(m_2, s_2, x_2)
               addto_2 = SIMDPirates.vifelse(m_2, x_2, s_2)
               c_2 = SIMDPirates.evsub(c_2, SIMDPirates.evadd(SIMDPirates.evsub(subfrom_2, t_2), addto_2))
               s_2 = t_2
           end
       
           c_1 = SIMDPirates.evadd(c_1, c_2)
           c_3 = SIMDPirates.evadd(c_3, c_4)
           c_1 = SIMDPirates.evadd(c_1, c_3)
       
           s_1 = SIMDPirates.evadd(s_1, s_2)
           s_3 = SIMDPirates.evadd(s_3, s_4)
           s_1 = SIMDPirates.evadd(s_1, s_3)
           SIMDPirates.vsum(SIMDPirates.evsub(s_1,c_1))
       end
sum_kahan (generic function with 1 method)

julia> x = 10^14 .* sin.((0.001:0.001:1) .* 2π);

julia> @btime Float64(sum(big, $x))
  89.799 μs (3998 allocations: 218.64 KiB)
1.1248384929460264

julia> @btime sum($x)
  41.968 ns (0 allocations: 0 bytes)
0.49223069682955295

julia> @btime sum_kbn($x)
  1.240 μs (0 allocations: 0 bytes)
1.1248384929460276

julia> @btime sum_kahan($x)
  146.561 ns (0 allocations: 0 bytes)
1.1248384929460264

Errors:

julia> relative_error(x, y) = (x - y) / y
relative_error (generic function with 1 method)

julia> relative_error(sum_kbn(x), Float64(sum(big, x)))
9.87006607248485e-16

julia> relative_error(sum_kahan(x), Float64(sum(big, x)))
0.0

julia> relative_error(sum(x), Float64(sum(big, x)))
-0.562398779988078
2 Likes

Thanks for the explanation. It’s nice to be able to select between efficiency and safety.

And thanks also for the corrected implementation.


Well, not really. If we look at the reduction stage, it is formally computing something like:

vsum\left( \sum_{i=1}^4 s_i - c_i\right)

That is a lot of additions, and there might still be catastrophic cancellations in any of them.

Here is your implementation running on my machine. The difference with your results might be explained by different architectures (and, therefore, SIMD widths): I’m using W=4. But I’m pretty sure that you would see this implementation behave badly on Per’s input vectors.

julia> check_fun(x, sum_kahan)
(1.1249173664289736, 7.011982915034423e-5)

So you were right in your initial post: the reduction was indeed the problem; not only because of the lack of explicit instructions, but also because cancellations occur in it.

The following, slightly modified version of your implementation seems to be much more accurate. Again, because I’m more familiar with it, I’m using Knuth’s two_sum to compensate for errors (à la Ogita, Rump & Oishi), but one could probably very well use Dekker’s fast_two_sum instead (which would be more in line with the rest of the KBN algorithm).

import SIMDPirates
function sum_kahan(x::AbstractArray{T}) where {T <: Union{Float32,Float64}}
    Vec = SIMDPirates.Vec

    px = pointer(x)
    W, shift = VectorizationBase.pick_vector_width_shift(T)
    sizeT = sizeof(T)
    WT = W * sizeT
    WT4 = 4WT
    N = length(x)
    Base.Cartesian.@nexprs 4 u -> begin
        s_u = SIMDPirates.vbroadcast(Vec{W,T}, zero(T))
        c_u = SIMDPirates.vbroadcast(Vec{W,T}, zero(T))
    end
    Nshift = N >> (shift + 2)
    for n in 0:Nshift-1
        Base.Cartesian.@nexprs 4 u -> begin
            x_u = SIMDPirates.vload(Vec{W,T}, px + n*WT4 + (u-1)*WT)
            t_u = SIMDPirates.evadd(s_u, x_u)
            m_u = SIMDPirates.vless(SIMDPirates.vabs(x_u), SIMDPirates.vabs(s_u))
            subfrom_u = SIMDPirates.vifelse(m_u, s_u, x_u)
            addto_u = SIMDPirates.vifelse(m_u, x_u, s_u)
            c_u = SIMDPirates.evsub(c_u, SIMDPirates.evadd(SIMDPirates.evsub(subfrom_u, t_u), addto_u))
            s_u = t_u
        end
    end

    rem = N & (4W-1)
    offset = Nshift * WT4
    for n in 0:(rem >> shift) - 1
        x_1 = SIMDPirates.vload(Vec{W,T}, px + offset)
        t_1 = SIMDPirates.evadd(s_1, x_1)
        m_1 = SIMDPirates.vless(SIMDPirates.vabs(x_1), SIMDPirates.vabs(s_1))
        subfrom_1 = SIMDPirates.vifelse(m_1, s_1, x_1)
        addto_1 = SIMDPirates.vifelse(m_1, x_1, s_1)
        c_1 = SIMDPirates.evsub(c_1, SIMDPirates.evadd(SIMDPirates.evsub(subfrom_1, t_1), addto_1))
        s_1 = t_1
        offset += WT
    end

    rem &= (W-1)
    if rem > 0
        mask = unsafe_trunc(VectorizationBase.mask_type(T), 2^rem - 1)

        x_2 = SIMDPirates.vload(Vec{W,T}, px + offset, mask)
        t_2 = SIMDPirates.evadd(s_2, x_2)
        m_2 = SIMDPirates.vless(SIMDPirates.vabs(x_2), SIMDPirates.vabs(s_2))
        subfrom_2 = SIMDPirates.vifelse(m_2, s_2, x_2)
        addto_2 = SIMDPirates.vifelse(m_2, x_2, s_2)
        c_2 = SIMDPirates.evsub(c_2, SIMDPirates.evadd(SIMDPirates.evsub(subfrom_2, t_2), addto_2))
        s_2 = t_2
    end

    c_1 = SIMDPirates.evadd(c_1, c_2)
    c_3 = SIMDPirates.evadd(c_3, c_4)
    c_1 = SIMDPirates.evadd(c_1, c_3)

    # # Original version
    # s_1 = SIMDPirates.evadd(s_1, s_2)
    # s_3 = SIMDPirates.evadd(s_3, s_4)
    # s_1 = SIMDPirates.evadd(s_1, s_3)
    # return SIMDPirates.vsum(SIMDPirates.evsub(s_1, c_1))


    # Modified reduction
    s_1, e = two_sum(s_1, s_2); es = e
    s_3, e = two_sum(s_3, s_4); es = SIMDPirates.evadd(es, e)
    s_1, e = two_sum(s_1, s_3); es = SIMDPirates.evadd(es, e)

    s = zero(T)
    e = SIMDPirates.vsum(es) - SIMDPirates.vsum(c_1)
    for si in s_1
        s, ei = two_sum(s, si.value)
        e += ei
    end

    s + e
end

using SIMDPirates: evadd, evsub
function two_sum(a::T, b::T) where T <: NTuple
    x = evadd(a, b)
    z = evsub(x, a)
    e = evadd(evsub(a, evsub(x, z)),
              evsub(b, z))
    (x, e)
end
julia> x = 10^14 .* sin.((0.001:0.001:1) .* 2π);

julia> check_fun(x, sum_kahan)
(1.1248384929460264, 0.0)

And here is a performance comparison of all variants. On my machine, vectorized KBN seems to perform just slightly worse than vectorized ORO:

julia> x = 5000 |> N->randn(N) .* exp.(10 .* randn(N)) |> x->[x;-x;1.0] |> x->x[sortperm(rand(length(x)))];

julia> @btime sum($x)
  1.310 μs (0 allocations: 0 bytes)
-134.794921875

julia> @btime sum_oro_scal($x)
  19.697 μs (0 allocations: 0 bytes)
0.9999999999997371

julia> @btime sum_oro_vec($x)
  3.351 μs (0 allocations: 0 bytes)
0.9999999999999432

julia> @btime sum_kbn($x)
  20.416 μs (0 allocations: 0 bytes)
0.9999999999997371

julia> @btime sum_kahan($x)
  3.865 μs (0 allocations: 0 bytes)
0.9999999999999432
2 Likes

I just read through your earlier posts more thoroughly – great work!

If you’re referring to all the Base.Cartesian.@nexprs 4 u -> ..., this is because of instruction latency.
If you’d like some reading material, I found Performance Speed Limits interesting and informative. By unrolling 4 times, eg

julia> using MacroTools

julia> prettify(@macroexpand Base.Cartesian.@nexprs 4 u -> begin
                   xi = SIMD.vload(Vec{W,T}, x, offset)
                   s_u, ei = two_sum(s_u, xi)
                   e_u += ei
                   offset += W
               end)
quote
    xi = SIMD.vload(Vec{W, T}, x, offset)
    (s_1, ei) = two_sum(s_1, xi)
    e_1 += ei
    offset += W
    xi = SIMD.vload(Vec{W, T}, x, offset)
    (s_2, ei) = two_sum(s_2, xi)
    e_2 += ei
    offset += W
    xi = SIMD.vload(Vec{W, T}, x, offset)
    (s_3, ei) = two_sum(s_3, xi)
    e_3 += ei
    offset += W
    xi = SIMD.vload(Vec{W, T}, x, offset)
    (s_4, ei) = two_sum(s_4, xi)
    e_4 += ei
    offset += W
end

We are no longer bottle necked by instruction latency. Agner Fog’s instruction tables say that on Skylake(-X) the vectorized add, sub, mul, and fma instructions each have a latency of 4 clock cycles. These vary a little from one CPU family to another.
The Intel chips and 7 nm Ryzens should also be able to issue 2 instructions per clock cycle.

So that means to maximize performance, we need to have up to 8 instructions going at a time.

A simple example to see this, a rather naive sum function:

using VectorizationBase, SIMDPirates

@generated function usum(x::AbstractArray{T}, ::Val{Ushift} = Val(2)) where {T,Ushift}
    @assert 0 ≤ Ushift < 6
    U = 1 << Ushift
    W, shift = VectorizationBase.pick_vector_width_shift(T)
    totalshift = shift + Ushift
    WU = 1 << totalshift
    WT = W * sizeof(T)
    q = quote
        ptrx = pointer(x)
        Base.Cartesian.@nexprs $U u -> begin
            s_u = SIMDPirates.vbroadcast(Vec{$W,$T}, zero($T))
        end
        offset = 0
        N = length(x)
        for i in 1:(N >> $totalshift)
            Base.Cartesian.@nexprs $U u -> begin
                s_u = SIMDPirates.vadd(s_u, SIMDPirates.vload(Vec{$W,$T}, ptrx + offset))
                offset += $WT
            end
        end
        rem = N & $(WU - 1)
        for i in 1:(rem >> $shift)
            s_1 = SIMDPirates.vadd(s_1, SIMDPirates.vload(Vec{$W,$T}, ptrx + offset))
            offset += $WT
        end
        rem &= $(W-1)
        if rem > 0
            s_1 = SIMDPirates.vadd(s_1, SIMDPirates.vload(Vec{$W,$T}, ptrx + offset), VectorizationBase.mask(Val{$W}(), rem))
        end
    end
    Uh = U
    while Uh > 1
        Uh >>= 1
        push!(
            q.args,
            quote
            Base.Cartesian.@nexprs $Uh u -> begin
            s_u = SIMDPirates.vadd(s_u, s_{u + $Uh})
            end
            end
        )
    end
    push!(q.args, :(SIMDPirates.vsum(s_1)))
    q
end

I wanted the unroll factor to be a power of 2; so the function takes a Val argument Ushift, and we unroll by 2^Ushift = 1<<Ushift (on top of the vectorization unrolling).

We need large vectors to see benefit. Picking the largest vector that still fits in the L1 cache:

julia> first(VectorizationBase.CACHE_SIZE) ÷ sizeof(Float64)
4096

julia> x = rand(ans);

and running benchmarks:

julia> using BenchmarkTools

julia> @btime sum($x)
  296.868 ns (0 allocations: 0 bytes)
2036.5329867003175

julia> @btime usum($x, Val(0)) # unroll = 1 (no unrolling)
  471.189 ns (0 allocations: 0 bytes)
2036.5329867003172

julia> @btime usum($x, Val(1)) # unroll = 2^1 = 2
  249.814 ns (0 allocations: 0 bytes)
2036.532986700318

julia> @btime usum($x, Val(2)) # unroll = 2^2 = 4
  132.902 ns (0 allocations: 0 bytes)
2036.532986700318

julia> @btime usum($x, Val(3)) # unroll = 2^3 = 8
  84.432 ns (0 allocations: 0 bytes)
2036.532986700318

julia> @btime usum($x, Val(4)) # unroll = 2^4 = 16
  89.227 ns (0 allocations: 0 bytes)
2036.532986700318

julia> @btime usum($x, Val(5)) # unroll = 2^5 = 32
  108.952 ns (0 allocations: 0 bytes)
2036.532986700318

We can see that performance does indeed peak when we unroll by 8, which breaks up the dependency chain enough so that 8 vadd instructions are running at a time.

The difference is much less for smaller vectors. And if we’re summing less than 64 elements, unrolling by 8 means we’re actually not unrolling at all.

I actually realize now that I was too aggressive with the choice of 4. Our functions have a lot more going on than just repetitive adds, giving the CPU more to do without the need to break up dependency chains. Even just focusing on floating point arithmetic, your two_sum function can take advantage of order execution:

# D. E. Knuth, The Art of Computer Programming: Seminumerical Algorithms, 1969.
function two_sum(a::T, b::T) where T <: Union{Real, Vec}
    x = a + b
    z = x - a
    e = (a - (x-z)) + (b-z)
    (x, e)
end

x and z, and then x-z and b-z can each be calculated at the same time, for example.

As always, make hypothesis to try and understand what’s going on, and then test them. I modified your improvements to sum_kahan so that it will take arbitrary unrolls, like the simple sum function from earlier:

using VectorizationBase, SIMDPirates
using SIMDPirates: evadd, evsub
function two_sum(a::T, b::T) where T <: NTuple
    x = evadd(a, b)
    z = evsub(x, a)
    e = evadd(evsub(a, evsub(x, z)),
              evsub(b, z))
    (x, e)
end
function two_sum(a::T, b::T) where T <: Real
    x = a + b
    z = x - a
    e = (a - (x-z)) + (b-z)
    (x, e)
end

@generated function sum_kahan(x::AbstractArray{T}, ::Val{Ushift} = Val{1}()) where {T <: Union{Float32,Float64}, Ushift}
    @assert 0 ≤ Ushift < 6
    U = 1 << Ushift
    
    W, shift = VectorizationBase.pick_vector_width_shift(T)
    sizeT = sizeof(T)
    WT = W * sizeT
    WU = W*U
    WTU = WT*U
    V = SIMDPirates.Vec{W,T}

    q = quote
        px = pointer(x)
        N = length(x)
        Base.Cartesian.@nexprs $U u -> begin
            s_u = SIMDPirates.vbroadcast($V, zero($T))
            c_u = SIMDPirates.vbroadcast($V, zero($T))
        end
        Nshift = N >> $(shift + Ushift)
        offset = 0
        for n in 1:Nshift
            Base.Cartesian.@nexprs $U u -> begin
                x_u = SIMDPirates.vload($V, px + offset)
                t_u = SIMDPirates.evadd(s_u, x_u)
                m_u = SIMDPirates.vless(SIMDPirates.vabs(x_u), SIMDPirates.vabs(s_u))
                subfrom_u = SIMDPirates.vifelse(m_u, s_u, x_u)
                addto_u = SIMDPirates.vifelse(m_u, x_u, s_u)
                c_u = SIMDPirates.evsub(c_u, SIMDPirates.evadd(SIMDPirates.evsub(subfrom_u, t_u), addto_u))
                s_u = t_u
                offset += $WT
            end
        end

        rem = N & $(WU-1)
        for n in 1:(rem >> $shift)
            x_1 = SIMDPirates.vload($V, px + offset)
            t_1 = SIMDPirates.evadd(s_1, x_1)
            m_1 = SIMDPirates.vless(SIMDPirates.vabs(x_1), SIMDPirates.vabs(s_1))
            subfrom_1 = SIMDPirates.vifelse(m_1, s_1, x_1)
            addto_1 = SIMDPirates.vifelse(m_1, x_1, s_1)
            c_1 = SIMDPirates.evsub(c_1, SIMDPirates.evadd(SIMDPirates.evsub(subfrom_1, t_1), addto_1))
            s_1 = t_1
            offset += $WT
        end

        rem &= $(W-1)
        if rem > 0
            mask = VectorizationBase.mask(Val{$W}(), rem)

            x_2 = SIMDPirates.vload($V, px + offset, mask)
            t_2 = SIMDPirates.evadd(s_2, x_2)
            m_2 = SIMDPirates.vless(SIMDPirates.vabs(x_2), SIMDPirates.vabs(s_2))
            subfrom_2 = SIMDPirates.vifelse(m_2, s_2, x_2)
            addto_2 = SIMDPirates.vifelse(m_2, x_2, s_2)
            c_2 = SIMDPirates.evsub(c_2, SIMDPirates.evadd(SIMDPirates.evsub(subfrom_2, t_2), addto_2))
            s_2 = t_2
        end
        es = SIMDPirates.vbroadcast($V, zero($T))
    end
    Uh = U
    while Uh > 1
        Uh >>= 1
        qtemp = quote
            Base.Cartesian.@nexprs $Uh u -> begin
                c_u = SIMDPirates.evadd(c_u, c_{u+$Uh})
                (s_u, e) = two_sum(s_u, s_{u+$Uh})
                es = SIMDPirates.evadd(es, e)
            end
        end
        push!(q.args, qtemp)
    end

    qscalarreduce = quote
        s = zero(T)
        e = SIMDPirates.vsum(es) - SIMDPirates.vsum(c_1)
        for si in s_1
            s, ei = two_sum(s, si.value)
            e += ei
        end

        s + e
    end
    push!(q.args, qscalarreduce)
    q
end

Now, we see that 2^1 = 2, and not 4 as I used, leads to the best performance:

julia> y = 10^14 .* sin.(LinRange(0, 2π, 4096));

julia> @btime sum($y)
  282.450 ns (0 allocations: 0 bytes)
16.0

julia> @btime sum(big, $y)
  375.410 μs (16382 allocations: 895.89 KiB)
-0.21548719867825956442164425652663339860737323760986328125

julia> @btime sum_kbn($y)
  5.098 μs (0 allocations: 0 bytes)
-0.21548719867826094

julia> @btime sum_kahan($y, Val(0))
  738.659 ns (0 allocations: 0 bytes)
-0.21548719867825916

julia> @btime sum_kahan($y, Val(1))
  607.716 ns (0 allocations: 0 bytes)
-0.21548719867825916

julia> @btime sum_kahan($y, Val(2))
  623.291 ns (0 allocations: 0 bytes)
-0.21548719867825916

julia> @btime sum_kahan($y, Val(3))
  631.864 ns (0 allocations: 0 bytes)
-0.2154871986782596

I’d guess that sum_oro_vec will also see slightly better performance with 2x instead of 4x.

6 Likes

I am enjoying the exploration that you (collectively) have embarked upon. For the benefit of those less familiar with all this, when you have found a good, somewhat transportable approach … do let us know rather than assuming all who are interested would get the message from reading each part of the development. Thanks.

9 Likes