# Sum an SVector skipping one element

I am optimizing a tight loop where one operation is summing an `SVector` except that it is required that an element is skipped. Its index is not known at compile time, it is obtained from a lookup table.

Some attempts:

``````using StaticArrays, BenchmarkTools

# ver1: cheating because float +/- is not associative, but easy to code
sum_but1(x::SVector, o) = sum(x) - x[o]

# ver2: try to zero it out
function sum_but2(x::SVector{N,T}, o) where {N,T}
xm = MVector(x)
xm[o] = zero(T)
sum(SVector(xm))
end

# ver3: skip in a sophisticated way
function sum_but3(x::SVector{N,T}, o) where {N,T}
_, s = foldl(x; init = (1, zero(T))) do (j, s), x
j + 1, (j == o ? s : s + x)::T
end
s
end

# benchmark with multiple o's, to represent the context better
function f(sum_f::F, x, os) where F
sum(o -> sum_f(x, o), os)
end
``````

Benchmarks:

``````julia> x = randn(SVector{5})
5-element SVector{5, Float64} with indices SOneTo(5):
-0.8928591641259742
-1.0148090290182
-0.2777069921829845
-0.49826579522478914
0.9804687828089862

julia> os = rand(1:5, 100);

julia> @benchmark f(\$sum_but1, \$x, \$os)
BenchmarkTools.Trial: 10000 samples with 966 evaluations.
Range (min β¦ max):  79.156 ns β¦ 122.954 ns  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     80.699 ns               β GC (median):    0.00%
Time  (mean Β± Ο):   81.405 ns Β±   2.188 ns  β GC (mean Β± Ο):  0.00% Β± 0.00%

ββββ        ββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
79.2 ns         Histogram: frequency by time           91 ns <

Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark f(\$sum_but2, \$x, \$os)
BenchmarkTools.Trial: 10000 samples with 181 evaluations.
Range (min β¦ max):  553.343 ns β¦ 23.668 ΞΌs  β GC (min β¦ max):  0.00% β¦ 96.16%
Time  (median):     634.442 ns              β GC (median):     0.00%
Time  (mean Β± Ο):   831.999 ns Β±  1.448 ΞΌs  β GC (mean Β± Ο):  20.81% Β± 11.23%

ββ                                                           β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
553 ns        Histogram: log(frequency) by time      12.2 ΞΌs <

Memory estimate: 4.69 KiB, allocs estimate: 100.

julia> @benchmark f(\$sum_but3, \$x, \$os)
BenchmarkTools.Trial: 10000 samples with 988 evaluations.
Range (min β¦ max):  48.591 ns β¦ 93.220 ns  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     48.770 ns              β GC (median):    0.00%
Time  (mean Β± Ο):   49.312 ns Β±  2.936 ns  β GC (mean Β± Ο):  0.00% Β± 0.00%

ββ             β                                            β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
48.6 ns      Histogram: log(frequency) by time      68.6 ns <

Memory estimate: 0 bytes, allocs estimate: 0.
``````

Donβt think I can do much better than version 3, but thought I would ask here.

For numerical arrays, and assuming that the indices are always within bounds, the following seems comparable to version 3, but is easier to read:

``````julia> function sum_but4(x::SVector, o)
sum(@inbounds Base.setindex(x, zero(eltype(x)), o))
end
sum_but4 (generic function with 1 method)

julia> @benchmark f(\$sum_but3, \$x, \$os)
BenchmarkTools.Trial: 10000 samples with 963 evaluations.
Range (min β¦ max):  84.299 ns β¦ 124.819 ns  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     84.543 ns               β GC (median):    0.00%
Time  (mean Β± Ο):   84.659 ns Β±   1.224 ns  β GC (mean Β± Ο):  0.00% Β± 0.00%

ββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
84.3 ns         Histogram: frequency by time         87.1 ns <

Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark f(\$sum_but4, \$x, \$os)
BenchmarkTools.Trial: 10000 samples with 967 evaluations.
Range (min β¦ max):  81.271 ns β¦ 105.489 ns  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     81.577 ns               β GC (median):    0.00%
Time  (mean Β± Ο):   81.659 ns Β±   0.623 ns  β GC (mean Β± Ο):  0.00% Β± 0.00%

ββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
81.3 ns         Histogram: frequency by time         83.8 ns <

Memory estimate: 0 bytes, allocs estimate: 0.
``````
1 Like

Another quite explicit version that has the same performance on my machine:

``````function sum_but5(x::SVector{N,T}, o) where {N,T}
s = zero(T)
for i in 1:N
s += ifelse(i == o, zero(T), x[i])
end
return s
end
``````

Inspecting the assembly, I would have expected the compiler to emit conditional moves but it didnβt. Perhaps someone more knowledgable can comment on this

2 Likes

Remains performant and works for arbitrary array types, eltypes and aggregations β even without neutral `zero` element:

``````julia> using Accessors
julia> sum_but6(x, o) = sum(@delete x[o])
julia> @btime f(\$sum_but6, \$x, \$os)
85.970 ns (0 allocations: 0 bytes)
``````
3 Likes

In my machine `sum_but1` and `sum_but3` perform the same, and `sum_but5` is faster. (Iβm using 1.10).

1 Like

For larger N:

``````N=500
x = randn(SVector{N})
os = rand(1:N, 100);
``````

`sum_bu1` is fastest:

``````julia> @benchmark f(\$sum_but1, \$x, \$os)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max):  35.500 ΞΌs β¦ 137.100 ΞΌs  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     35.500 ΞΌs               β GC (median):    0.00%
Time  (mean Β± Ο):   36.164 ΞΌs Β±   4.234 ΞΌs  β GC (mean Β± Ο):  0.00% Β± 0.00%

β β                                                          β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
35.5 ΞΌs       Histogram: log(frequency) by time      56.5 ΞΌs <

Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark f(\$sum_but3, \$x, \$os)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max):  36.900 ΞΌs β¦ 189.300 ΞΌs  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     37.000 ΞΌs               β GC (median):    0.00%
Time  (mean Β± Ο):   37.985 ΞΌs Β±   4.973 ΞΌs  β GC (mean Β± Ο):  0.00% Β± 0.00%

β β β      β                                                 β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
36.9 ΞΌs       Histogram: log(frequency) by time      63.9 ΞΌs <

Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark f(\$sum_but5, \$x, \$os)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max):  38.300 ΞΌs β¦ 139.400 ΞΌs  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     38.400 ΞΌs               β GC (median):    0.00%
Time  (mean Β± Ο):   39.184 ΞΌs Β±   4.843 ΞΌs  β GC (mean Β± Ο):  0.00% Β± 0.00%

β   β                                                        β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
38.3 ΞΌs       Histogram: log(frequency) by time      66.1 ΞΌs <

Memory estimate: 0 bytes, allocs estimate: 0.
``````

Just saying because itβs not clear if your real problem is about such small vectors.
(Julia 1.9.3)

3 Likes

Yes, in 1.10 with these vectors they perform roughly the same for me.

1 Like

The array-language solution:

``````sum_but7(v::SVector, o) = sum(((i, x),) -> (i != o) .* x, enumerate(v))
Seems comparable, but maybe a bit slower, than `sum_but1` on my machine