Sum an SVector skipping one element

Tamas_Papp · November 30, 2023, 11:11am

I am optimizing a tight loop where one operation is summing an SVector except that it is required that an element is skipped. Its index is not known at compile time, it is obtained from a lookup table.

Some attempts:

using StaticArrays, BenchmarkTools

# ver1: cheating because float +/- is not associative, but easy to code
sum_but1(x::SVector, o) = sum(x) - x[o]

# ver2: try to zero it out 
function sum_but2(x::SVector{N,T}, o) where {N,T}
    xm = MVector(x)
    xm[o] = zero(T)
    sum(SVector(xm))
end

# ver3: skip in a sophisticated way
function sum_but3(x::SVector{N,T}, o) where {N,T}
    _, s = foldl(x; init = (1, zero(T))) do (j, s), x
        j + 1, (j == o ? s : s + x)::T
    end
    s
end

# benchmark with multiple o's, to represent the context better
function f(sum_f::F, x, os) where F
    sum(o -> sum_f(x, o), os)
end

Benchmarks:

julia> x = randn(SVector{5})
5-element SVector{5, Float64} with indices SOneTo(5):
 -0.8928591641259742
 -1.0148090290182
 -0.2777069921829845
 -0.49826579522478914
  0.9804687828089862

julia> os = rand(1:5, 100);

julia> @benchmark f($sum_but1, $x, $os)
BenchmarkTools.Trial: 10000 samples with 966 evaluations.
 Range (min … max):  79.156 ns … 122.954 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     80.699 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   81.405 ns ±   2.188 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▂█▅▁        ▄█                                           
  ▅▄▂▃▆████▄▁▁▁▁▃▃▁██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  79.2 ns         Histogram: frequency by time           91 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark f($sum_but2, $x, $os)
BenchmarkTools.Trial: 10000 samples with 181 evaluations.
 Range (min … max):  553.343 ns … 23.668 μs  ┊ GC (min … max):  0.00% … 96.16%
 Time  (median):     634.442 ns              ┊ GC (median):     0.00%
 Time  (mean ± σ):   831.999 ns ±  1.448 μs  ┊ GC (mean ± σ):  20.81% ± 11.23%

  █▄                                                           ▁
  ███▆▄▃▄▄▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃█ █
  553 ns        Histogram: log(frequency) by time      12.2 μs <

 Memory estimate: 4.69 KiB, allocs estimate: 100.

julia> @benchmark f($sum_but3, $x, $os)
BenchmarkTools.Trial: 10000 samples with 988 evaluations.
 Range (min … max):  48.591 ns … 93.220 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     48.770 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   49.312 ns ±  2.936 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▁             ▂                                            ▁
  ██▄▇█▆▆▅▆▆▅▆▅▆▅█▇▆▅▆▅▄▄▄▄▅▄▅▄▅▃▃▃▄▆▄▁▄▃▃▁▃▁▃▃▄▄▄▄▄▄▄▅▁▄▃▃▅▅ █
  48.6 ns      Histogram: log(frequency) by time      68.6 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Don’t think I can do much better than version 3, but thought I would ask here.

jishnub · November 30, 2023, 11:25am

For numerical arrays, and assuming that the indices are always within bounds, the following seems comparable to version 3, but is easier to read:

julia> function sum_but4(x::SVector, o)
           sum(@inbounds Base.setindex(x, zero(eltype(x)), o))
       end
sum_but4 (generic function with 1 method)

julia> @benchmark f($sum_but3, $x, $os)
BenchmarkTools.Trial: 10000 samples with 963 evaluations.
 Range (min … max):  84.299 ns … 124.819 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     84.543 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   84.659 ns ±   1.224 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▂██▄                                                       
  ▂▂▄████▆▃▂▂▂▁▁▁▁▁▁▁▁▂▂▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  84.3 ns         Histogram: frequency by time         87.1 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark f($sum_but4, $x, $os)
BenchmarkTools.Trial: 10000 samples with 967 evaluations.
 Range (min … max):  81.271 ns … 105.489 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     81.577 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   81.659 ns ±   0.623 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▄█▆▂                                                     
  ▂▂▃▄███████▆▄▃▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▂▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  81.3 ns         Histogram: frequency by time         83.8 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

abraemer · November 30, 2023, 12:00pm

Another quite explicit version that has the same performance on my machine:

function sum_but5(x::SVector{N,T}, o) where {N,T}
     s = zero(T)
     for i in 1:N
             s += ifelse(i == o, zero(T), x[i])
     end
     return s
end

Inspecting the assembly, I would have expected the compiler to emit conditional moves but it didn’t. Perhaps someone more knowledgable can comment on this

aplavin · November 30, 2023, 12:01pm

Remains performant and works for arbitrary array types, eltypes and aggregations – even without neutral zero element:

julia> using Accessors
julia> sum_but6(x, o) = sum(@delete x[o])
julia> @btime f($sum_but6, $x, $os)
  85.970 ns (0 allocations: 0 bytes)

lmiq · November 30, 2023, 12:17pm

In my machine sum_but1 and sum_but3 perform the same, and sum_but5 is faster. (I’m using 1.10).

oheil · November 30, 2023, 1:05pm

For larger N:

N=500
x = randn(SVector{N})
os = rand(1:N, 100);

sum_bu1 is fastest:

julia> @benchmark f($sum_but1, $x, $os)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  35.500 μs … 137.100 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     35.500 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   36.164 μs ±   4.234 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █ ▁                                                          ▁
  █▁█▃▁▇▄▃▄▁▆▆▅▇█▃▅▄▃▃▄▃▃▃▁▃▁▆▆▇▆▅▅▄▄▄▃▅▅▄▄▄▄▄▃▁▃▁▃▁▁▃▃▄▃▄▄▆▅▄ █
  35.5 μs       Histogram: log(frequency) by time      56.5 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark f($sum_but3, $x, $os)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  36.900 μs … 189.300 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     37.000 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   37.985 μs ±   4.973 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █ ▃ ▂      ▁                                                 ▁
  █▁█▅█▇▅▅▆▆██▆▄▁▃▁▁▁▃▁▇██▇▆▆▄▁▆▃▅▄▁▄▄▁▃▃▄▁▅▃▄▇▇▆▆▃▄▄▄▅▅▁▄▄▄▁▄ █
  36.9 μs       Histogram: log(frequency) by time      63.9 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark f($sum_but5, $x, $os)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  38.300 μs … 139.400 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     38.400 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   39.184 μs ±   4.843 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █   ▁                                                        ▁
  █▁█▁█▄▁▃▄▅▆▃▄▁▄▄▄▄▅▆▄▇▇▇▇▄▅▁▄▄▁▄▄▃▃▃▁▁▁▄▄▄▄▆▆▅▄▄▄▃▄▄▄▄▄▃▄▃▅▄ █
  38.3 μs       Histogram: log(frequency) by time      66.1 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Just saying because it’s not clear if your real problem is about such small vectors.
(Julia 1.9.3)

lmiq · November 30, 2023, 2:46pm

Yes, in 1.10 with these vectors they perform roughly the same for me.

bertschi · November 30, 2023, 6:12pm

The array-language solution:

sum_but7(v::SVector, o) = sum(((i, x),) -> (i != o) .* x, enumerate(v))
# broadcasting might be easier to read, but allocates:
# sum_but8(v::SVector, o) = sum((eachindex(v) .!= o) .* v)

Seems comparable, but maybe a bit slower, than sum_but1 on my machine

Topic		Replies	Views
Matrix of SVectors allocating Performance staticarrays	7	279	August 2, 2024
Bug in BenchmarkTools? General Usage	2	192	February 8, 2023
How to create an SVector of SVectors part II General Usage question , staticarrays	7	425	May 14, 2021
How to create SMatrix from SVectors efficiently Performance staticarrays	1	480	February 23, 2022
Constructing SVector with a loop General Usage question	17	1001	May 20, 2025

Sum an SVector skipping one element

Related topics