Sum an SVector skipping one element

I am optimizing a tight loop where one operation is summing an SVector except that it is required that an element is skipped. Its index is not known at compile time, it is obtained from a lookup table.

Some attempts:

using StaticArrays, BenchmarkTools

# ver1: cheating because float +/- is not associative, but easy to code
sum_but1(x::SVector, o) = sum(x) - x[o]

# ver2: try to zero it out 
function sum_but2(x::SVector{N,T}, o) where {N,T}
    xm = MVector(x)
    xm[o] = zero(T)
    sum(SVector(xm))
end

# ver3: skip in a sophisticated way
function sum_but3(x::SVector{N,T}, o) where {N,T}
    _, s = foldl(x; init = (1, zero(T))) do (j, s), x
        j + 1, (j == o ? s : s + x)::T
    end
    s
end

# benchmark with multiple o's, to represent the context better
function f(sum_f::F, x, os) where F
    sum(o -> sum_f(x, o), os)
end

Benchmarks:

julia> x = randn(SVector{5})
5-element SVector{5, Float64} with indices SOneTo(5):
 -0.8928591641259742
 -1.0148090290182
 -0.2777069921829845
 -0.49826579522478914
  0.9804687828089862

julia> os = rand(1:5, 100);

julia> @benchmark f($sum_but1, $x, $os)
BenchmarkTools.Trial: 10000 samples with 966 evaluations.
 Range (min … max):  79.156 ns … 122.954 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     80.699 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   81.405 ns Β±   2.188 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

       β–‚β–ˆβ–…β–        β–„β–ˆ                                           
  β–…β–„β–‚β–ƒβ–†β–ˆβ–ˆβ–ˆβ–ˆβ–„β–β–β–β–β–ƒβ–ƒβ–β–ˆβ–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β– β–‚
  79.2 ns         Histogram: frequency by time           91 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark f($sum_but2, $x, $os)
BenchmarkTools.Trial: 10000 samples with 181 evaluations.
 Range (min … max):  553.343 ns … 23.668 ΞΌs  β”Š GC (min … max):  0.00% … 96.16%
 Time  (median):     634.442 ns              β”Š GC (median):     0.00%
 Time  (mean Β± Οƒ):   831.999 ns Β±  1.448 ΞΌs  β”Š GC (mean Β± Οƒ):  20.81% Β± 11.23%

  β–ˆβ–„                                                           ▁
  β–ˆβ–ˆβ–ˆβ–†β–„β–ƒβ–„β–„β–ƒβ–ƒβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ƒβ–ˆ β–ˆ
  553 ns        Histogram: log(frequency) by time      12.2 ΞΌs <

 Memory estimate: 4.69 KiB, allocs estimate: 100.

julia> @benchmark f($sum_but3, $x, $os)
BenchmarkTools.Trial: 10000 samples with 988 evaluations.
 Range (min … max):  48.591 ns … 93.220 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     48.770 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   49.312 ns Β±  2.936 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆβ–             β–‚                                            ▁
  β–ˆβ–ˆβ–„β–‡β–ˆβ–†β–†β–…β–†β–†β–…β–†β–…β–†β–…β–ˆβ–‡β–†β–…β–†β–…β–„β–„β–„β–„β–…β–„β–…β–„β–…β–ƒβ–ƒβ–ƒβ–„β–†β–„β–β–„β–ƒβ–ƒβ–β–ƒβ–β–ƒβ–ƒβ–„β–„β–„β–„β–„β–„β–„β–…β–β–„β–ƒβ–ƒβ–…β–… β–ˆ
  48.6 ns      Histogram: log(frequency) by time      68.6 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Don’t think I can do much better than version 3, but thought I would ask here.

For numerical arrays, and assuming that the indices are always within bounds, the following seems comparable to version 3, but is easier to read:

julia> function sum_but4(x::SVector, o)
           sum(@inbounds Base.setindex(x, zero(eltype(x)), o))
       end
sum_but4 (generic function with 1 method)

julia> @benchmark f($sum_but3, $x, $os)
BenchmarkTools.Trial: 10000 samples with 963 evaluations.
 Range (min … max):  84.299 ns … 124.819 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     84.543 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   84.659 ns Β±   1.224 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

     β–‚β–ˆβ–ˆβ–„                                                       
  β–‚β–‚β–„β–ˆβ–ˆβ–ˆβ–ˆβ–†β–ƒβ–‚β–‚β–‚β–β–β–β–β–β–β–β–β–‚β–‚β–β–β–β–β–β–β–β–β–β–β–‚β–β–β–β–β–β–β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚ β–ƒ
  84.3 ns         Histogram: frequency by time         87.1 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark f($sum_but4, $x, $os)
BenchmarkTools.Trial: 10000 samples with 967 evaluations.
 Range (min … max):  81.271 ns … 105.489 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     81.577 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   81.659 ns Β±   0.623 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

       β–„β–ˆβ–†β–‚                                                     
  β–‚β–‚β–ƒβ–„β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–„β–ƒβ–‚β–‚β–‚β–‚β–‚β–β–‚β–β–β–β–β–β–β–β–β–β–β–‚β–β–β–β–β–‚β–β–β–β–β–β–β–β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚ β–ƒ
  81.3 ns         Histogram: frequency by time         83.8 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
1 Like

Another quite explicit version that has the same performance on my machine:

function sum_but5(x::SVector{N,T}, o) where {N,T}
     s = zero(T)
     for i in 1:N
             s += ifelse(i == o, zero(T), x[i])
     end
     return s
end

Inspecting the assembly, I would have expected the compiler to emit conditional moves but it didn’t. Perhaps someone more knowledgable can comment on this :slight_smile:

2 Likes

Remains performant and works for arbitrary array types, eltypes and aggregations – even without neutral zero element:

julia> using Accessors
julia> sum_but6(x, o) = sum(@delete x[o])
julia> @btime f($sum_but6, $x, $os)
  85.970 ns (0 allocations: 0 bytes)
3 Likes

In my machine sum_but1 and sum_but3 perform the same, and sum_but5 is faster. (I’m using 1.10).

1 Like

For larger N:

N=500
x = randn(SVector{N})
os = rand(1:N, 100);

sum_bu1 is fastest:

julia> @benchmark f($sum_but1, $x, $os)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  35.500 ΞΌs … 137.100 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     35.500 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   36.164 ΞΌs Β±   4.234 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆ ▁                                                          ▁
  β–ˆβ–β–ˆβ–ƒβ–β–‡β–„β–ƒβ–„β–β–†β–†β–…β–‡β–ˆβ–ƒβ–…β–„β–ƒβ–ƒβ–„β–ƒβ–ƒβ–ƒβ–β–ƒβ–β–†β–†β–‡β–†β–…β–…β–„β–„β–„β–ƒβ–…β–…β–„β–„β–„β–„β–„β–ƒβ–β–ƒβ–β–ƒβ–β–β–ƒβ–ƒβ–„β–ƒβ–„β–„β–†β–…β–„ β–ˆ
  35.5 ΞΌs       Histogram: log(frequency) by time      56.5 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark f($sum_but3, $x, $os)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  36.900 ΞΌs … 189.300 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     37.000 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   37.985 ΞΌs Β±   4.973 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆ β–ƒ β–‚      ▁                                                 ▁
  β–ˆβ–β–ˆβ–…β–ˆβ–‡β–…β–…β–†β–†β–ˆβ–ˆβ–†β–„β–β–ƒβ–β–β–β–ƒβ–β–‡β–ˆβ–ˆβ–‡β–†β–†β–„β–β–†β–ƒβ–…β–„β–β–„β–„β–β–ƒβ–ƒβ–„β–β–…β–ƒβ–„β–‡β–‡β–†β–†β–ƒβ–„β–„β–„β–…β–…β–β–„β–„β–„β–β–„ β–ˆ
  36.9 ΞΌs       Histogram: log(frequency) by time      63.9 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark f($sum_but5, $x, $os)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  38.300 ΞΌs … 139.400 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     38.400 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   39.184 ΞΌs Β±   4.843 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆ   ▁                                                        ▁
  β–ˆβ–β–ˆβ–β–ˆβ–„β–β–ƒβ–„β–…β–†β–ƒβ–„β–β–„β–„β–„β–„β–…β–†β–„β–‡β–‡β–‡β–‡β–„β–…β–β–„β–„β–β–„β–„β–ƒβ–ƒβ–ƒβ–β–β–β–„β–„β–„β–„β–†β–†β–…β–„β–„β–„β–ƒβ–„β–„β–„β–„β–„β–ƒβ–„β–ƒβ–…β–„ β–ˆ
  38.3 ΞΌs       Histogram: log(frequency) by time      66.1 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Just saying because it’s not clear if your real problem is about such small vectors.
(Julia 1.9.3)

3 Likes

Yes, in 1.10 with these vectors they perform roughly the same for me.

1 Like

The array-language solution:

sum_but7(v::SVector, o) = sum(((i, x),) -> (i != o) .* x, enumerate(v))
# broadcasting might be easier to read, but allocates:
# sum_but8(v::SVector, o) = sum((eachindex(v) .!= o) .* v)

Seems comparable, but maybe a bit slower, than sum_but1 on my machine

2 Likes