Hey,
I have a computation that we already discussed there and which still takes up more than half of my total runtime (so days…).
I did found a very clunky way to cut in half it’s runtime by manually unrolling a loop, which allowed to remove some allocations. Unfortunately, this only works for a given value of M
, as i have to write the code by hand (writting the loop will prevent @tturbo
to work since its order do matter). Maybe there is a way to do it programatically for all M
?
Here is the code :
using StaticArrays, LoopVectorization, BenchmarkTools
function exp_and_mean!(slack,D,N)
zz = zero(eltype(D))
@tturbo for i in 1:N
slack[i] = exp(-D[i])
zz += slack[i]
end
zz /= N
return zz
end
function prod_and_mean!(slack,D,N)
zz = zero(eltype(D))
@tturbo for i in 1:N
slack[i] *= D[i]
zz += slack[i]
end
zz /= N
return zz
end
function sensitive_function!(rez,slack,D)
M = length(rez)
N = length(D)
rez[1] = exp_and_mean!(slack,D,N)
for k in 2:M
rez[k] = prod_and_mean!(slack,D,N)
end
return rez
end
function v2_only_when_M_is_10(rez,D)
M = length(rez)
@assert M == 10
N = length(D)
r₁ = r₂ = r₃ = r₄ = r₅ = r₆ = r₇ = r₈ = r₉ = r₁₀ = 0.0
@tturbo for i in 1:N
Dᵢ = D[i]
eᵢ = exp(-Dᵢ)
r₁ += eᵢ
eᵢ *= Dᵢ
r₂ += eᵢ
eᵢ *= Dᵢ
r₃ += eᵢ
eᵢ *= Dᵢ
r₄ += eᵢ
eᵢ *= Dᵢ
r₅ += eᵢ
eᵢ *= Dᵢ
r₆ += eᵢ
eᵢ *= Dᵢ
r₇ += eᵢ
eᵢ *= Dᵢ
r₈ += eᵢ
eᵢ *= Dᵢ
r₉ += eᵢ
eᵢ *= Dᵢ
r₁₀+= eᵢ
end
rez .= [r₁,r₂,r₃,r₄,r₅,r₆,r₇,r₈,r₉,r₁₀] ./ N
return nothing
end
# Simple but typical use-case :
M = 10
N = 100000
D = exp.(randn(N))
slack = zero(D)
rez = MVector{M,Float64}(zeros(M))
sensitive_function!(rez,slack,D)
rez2 = MVector{M,Float64}(zeros(M))
v2_only_when_M_is_10(rez2,D)
@assert all(rez .≈ rez2)
b = @benchmark sensitive_function!($rez,$slack,$D);
display(b)
# BenchmarkTools.Trial: 10000 samples with 1 evaluation.
# Range (min … max): 152.000 μs … 1.111 ms ┊ GC (min … max): 0.00% … 0.00%
# Time (median): 296.400 μs ┊ GC (median): 0.00%
# Time (mean ± σ): 260.313 μs ± 62.834 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
# ▃▇▁▃▃▁ ▃▄▃▂▁▁▁ ▅▄▃▄█▇▆▄▂▁ ▂
# ▃██████▆▆▅▆▆▅▇████████▇█▆▆▅▆▅▄▅▂▄▄▄▂▃▄██████████████▇▇▇▅▆▆▆▄ █
# 152 μs Histogram: log(frequency) by time 358 μs <
#
# Memory estimate: 0 bytes, allocs estimate: 0.
b2 = @benchmark v2_only_when_M_is_10($rez,$D)
display(b2)
# BenchmarkTools.Trial: 10000 samples with 1 evaluation.
# Range (min … max): 83.300 μs … 22.182 ms ┊ GC (min … max): 0.00% … 0.00%
# Time (median): 130.000 μs ┊ GC (median): 0.00%
# Time (mean ± σ): 132.251 μs ± 283.493 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
# ▃ ▅ ▂▄▃▁▅▄ ▅█▃▂▅▃▁▁ ▁ ▁
# █▆█▇▇▅▇▆▄▆▆▅▄▃▃▅▇▇███████████████████▇▇█▅▆▆▆▅▅▅▆▆▅▅▂▂▅▄▅▄▄▄▄█ █
# 83.3 μs Histogram: log(frequency) by time 184 μs <
# Memory estimate: 160 bytes, allocs estimate: 1.
# How to contruct a function like the version 2 so that it works for all M ?
# Maybe a macro that will unroll by hand the loop and produce many functions ?
I would like to construct methods for sensitive_function!
that would do the same thing, with the same performance gains, but for any size M
of the (static) vector rez
. Is that possible ? I was thinking about a macro and code generation, but I never wrote one so I come for help.
What would be perfect is a macro that would produce methods like:
sensitive_function!(rez::MVector{1,Float64},D)
sensitive_function!(rez::MVector{2,Float64},D)
sensitive_function!(rez::MVector{3,Float64},D)
sensitive_function!(rez::MVector{4,Float64},D)
...
until a given maximum, after which the default version would work on the rest. Maybe this hack is not viable for too big values of M, when the other order of loops is still relevant, in which case we’ll fall back to the previous version.