Hi,

I have a simple operation to make, which represents 80% of my runtime. I am trying to make it as fast as possibleβ¦ Here is the sketch of the problem :

```
using Statistics, BenchmarkTools
function exp_and_mean!(slack,D,N)
slack .= exp.(-D)
return mean(slack)
end
function prod_and_mean!(slack,D,N)
slack .*= D
return mean(slack)
end
function runtime_test!(rez,slack,D)
n = length(rez)
N = length(D)
rez[1] = exp_and_mean!(slack,D,N)
for i in 1:n-1
rez[i+1] = prod_and_mean!(slack,D,N)
end
return rez
end
# Simple but typical use-case :
@benchmark runtime_test!(pars...) setup=(pars=(rez=zeros(20), slack=zeros(10000), D=exp.(randn(10000))))
```

Which outputs :

```
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 95.300 ΞΌs β¦ 5.751 ms β GC (min β¦ max): 0.00% β¦ 97.30%
Time (median): 121.900 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 132.294 ΞΌs Β± 134.750 ΞΌs β GC (mean Β± Ο): 3.53% Β± 3.56%
β ββ
ββββ
β
ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββ β
95.3 ΞΌs Histogram: frequency by time 257 ΞΌs <
Memory estimate: 78.20 KiB, allocs estimate: 2.
```

The first thing I have done was removing allocations and vectorising the loops :

```
# Second version :
using LoopVectorization
function exp_and_mean!(slack,D,N)
zz = zero(eltype(D))
@turbo for i in 1:N
slack[i] = exp(-D[i])
zz += slack[i]
end
zz /= N
return zz
end
function prod_and_mean!(slack,D,N)
zz = zero(eltype(D))
@turbo for i in 1:N
slack[i] *= D[i]
zz += slack[i]
end
zz /= N
return zz
end
@benchmark runtime_test!(pars...) setup=(pars=(rez=zeros(20), slack=zeros(10000), D=exp.(randn(10000))))
```

Which outputs :

```
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 37.500 ΞΌs β¦ 164.800 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 41.100 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 42.453 ΞΌs Β± 6.724 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
βββββββββββ
ββββ β β
βββββββββββββββββββββ
ββ
ββββββββββββββββββ
β
ββ
β
β
β
β
β
ββ
β
βββ
ββ
β
β
β
β
37.5 ΞΌs Histogram: log(frequency) by time 75 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
```

This is a lot better, as the time was almost divided by 3.

I was wondering if we could do more, if I still have problems, or if I am fighting against the bare metal here. What do you think ?