I am optimizing a tight loop where one operation is summing an SVector
except that it is required that an element is skipped. Its index is not known at compile time, it is obtained from a lookup table.
Some attempts:
using StaticArrays, BenchmarkTools
# ver1: cheating because float +/- is not associative, but easy to code
sum_but1(x::SVector, o) = sum(x) - x[o]
# ver2: try to zero it out
function sum_but2(x::SVector{N,T}, o) where {N,T}
xm = MVector(x)
xm[o] = zero(T)
sum(SVector(xm))
end
# ver3: skip in a sophisticated way
function sum_but3(x::SVector{N,T}, o) where {N,T}
_, s = foldl(x; init = (1, zero(T))) do (j, s), x
j + 1, (j == o ? s : s + x)::T
end
s
end
# benchmark with multiple o's, to represent the context better
function f(sum_f::F, x, os) where F
sum(o -> sum_f(x, o), os)
end
Benchmarks:
julia> x = randn(SVector{5})
5-element SVector{5, Float64} with indices SOneTo(5):
-0.8928591641259742
-1.0148090290182
-0.2777069921829845
-0.49826579522478914
0.9804687828089862
julia> os = rand(1:5, 100);
julia> @benchmark f($sum_but1, $x, $os)
BenchmarkTools.Trial: 10000 samples with 966 evaluations.
Range (min β¦ max): 79.156 ns β¦ 122.954 ns β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 80.699 ns β GC (median): 0.00%
Time (mean Β± Ο): 81.405 ns Β± 2.188 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
βββ
β ββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
79.2 ns Histogram: frequency by time 91 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark f($sum_but2, $x, $os)
BenchmarkTools.Trial: 10000 samples with 181 evaluations.
Range (min β¦ max): 553.343 ns β¦ 23.668 ΞΌs β GC (min β¦ max): 0.00% β¦ 96.16%
Time (median): 634.442 ns β GC (median): 0.00%
Time (mean Β± Ο): 831.999 ns Β± 1.448 ΞΌs β GC (mean Β± Ο): 20.81% Β± 11.23%
ββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
553 ns Histogram: log(frequency) by time 12.2 ΞΌs <
Memory estimate: 4.69 KiB, allocs estimate: 100.
julia> @benchmark f($sum_but3, $x, $os)
BenchmarkTools.Trial: 10000 samples with 988 evaluations.
Range (min β¦ max): 48.591 ns β¦ 93.220 ns β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 48.770 ns β GC (median): 0.00%
Time (mean Β± Ο): 49.312 ns Β± 2.936 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
ββ β β
ββββββββ
βββ
ββ
ββ
ββββ
ββ
βββββ
ββ
ββ
βββββββββββββββββββββββ
βββββ
β
β
48.6 ns Histogram: log(frequency) by time 68.6 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
Donβt think I can do much better than version 3, but thought I would ask here.
jishnub
November 30, 2023, 11:25am
2
For numerical arrays, and assuming that the indices are always within bounds, the following seems comparable to version 3, but is easier to read:
julia> function sum_but4(x::SVector, o)
sum(@inbounds Base.setindex(x, zero(eltype(x)), o))
end
sum_but4 (generic function with 1 method)
julia> @benchmark f($sum_but3, $x, $os)
BenchmarkTools.Trial: 10000 samples with 963 evaluations.
Range (min β¦ max): 84.299 ns β¦ 124.819 ns β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 84.543 ns β GC (median): 0.00%
Time (mean Β± Ο): 84.659 ns Β± 1.224 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
84.3 ns Histogram: frequency by time 87.1 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark f($sum_but4, $x, $os)
BenchmarkTools.Trial: 10000 samples with 967 evaluations.
Range (min β¦ max): 81.271 ns β¦ 105.489 ns β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 81.577 ns β GC (median): 0.00%
Time (mean Β± Ο): 81.659 ns Β± 0.623 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
81.3 ns Histogram: frequency by time 83.8 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
1 Like
Another quite explicit version that has the same performance on my machine:
function sum_but5(x::SVector{N,T}, o) where {N,T}
s = zero(T)
for i in 1:N
s += ifelse(i == o, zero(T), x[i])
end
return s
end
Inspecting the assembly, I would have expected the compiler to emit conditional moves but it didnβt. Perhaps someone more knowledgable can comment on this
2 Likes
aplavin
November 30, 2023, 12:01pm
4
Remains performant and works for arbitrary array types, eltypes and aggregations β even without neutral zero
element:
julia> using Accessors
julia> sum_but6(x, o) = sum(@delete x[o])
julia> @btime f($sum_but6, $x, $os)
85.970 ns (0 allocations: 0 bytes)
3 Likes
lmiq
November 30, 2023, 12:17pm
5
In my machine sum_but1
and sum_but3
perform the same, and sum_but5
is faster. (Iβm using 1.10).
1 Like
oheil
November 30, 2023, 1:05pm
7
For larger N:
N=500
x = randn(SVector{N})
os = rand(1:N, 100);
sum_bu1
is fastest:
julia> @benchmark f($sum_but1, $x, $os)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 35.500 ΞΌs β¦ 137.100 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 35.500 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 36.164 ΞΌs Β± 4.234 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
β β β
βββββββββββββ
ββββ
βββββββββββββββ
β
βββββ
β
ββββββββββββββββββββ
β β
35.5 ΞΌs Histogram: log(frequency) by time 56.5 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark f($sum_but3, $x, $os)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 36.900 ΞΌs β¦ 189.300 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 37.000 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 37.985 ΞΌs Β± 4.973 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
β β β β β
ββββ
βββ
β
ββββββββββββββββββββββββ
ββββββββββ
βββββββββββ
β
ββββββ β
36.9 ΞΌs Histogram: log(frequency) by time 63.9 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark f($sum_but5, $x, $os)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 38.300 ΞΌs β¦ 139.400 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 38.400 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 39.184 ΞΌs Β± 4.843 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
β β β
ββββββββββ
βββββββββ
ββββββββ
βββββββββββββββββββ
βββββββββββββ
β β
38.3 ΞΌs Histogram: log(frequency) by time 66.1 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
Just saying because itβs not clear if your real problem is about such small vectors.
(Julia 1.9.3)
3 Likes
lmiq
November 30, 2023, 2:46pm
8
Yes, in 1.10 with these vectors they perform roughly the same for me.
1 Like
The array-language solution:
sum_but7(v::SVector, o) = sum(((i, x),) -> (i != o) .* x, enumerate(v))
# broadcasting might be easier to read, but allocates:
# sum_but8(v::SVector, o) = sum((eachindex(v) .!= o) .* v)
Seems comparable, but maybe a bit slower, than sum_but1
on my machine
2 Likes