Hi,

I have the follwing problem :

```
using Statistics, BenchmarkTools, ProfileView
function compute_exp_moments!(mu,D,e,slack,data)
data .= D'e # most expensive line according to profileview.
slack .= exp.(.-data)
mu[1] = mean(slack)
for i in 2:length(mu)
slack .*= data
mu[i] = mean(slack)/mu[1]
end
end
function compute_many(N,n,d,How_much)
D = reshape(exp.(randn(d*N)),(d,N))
slack = zeros(N)
data = zeros(N)
mu = zeros(n)
for i in 1:How_much
e = rand(d)
compute_exp_moments!(mu,D,e,slack,data)
end
return mu
end
@btime compute_many(10000,10,20,100) # 20ms for me
ProfileView.@profview compute_many(10000,10,20,1000)
```

The computation seems rather straightforward, but i have trouble making it faster. Much of the time is spend on the first matrix-vector product, as expected.

I tried Loopvectorisation together with StaticArrays and HybridArrays, improved by 20% :

```
function compute_exp_moments2!(mu,D,e,slack,data)
@turbo data .= D'e # Still takes much of the time.
@turbo slack .= exp.(.-data)
mu[1] = mean(slack)
for i in 2:length(mu) # I did not manage to use @turbo on this one.
slack .*= data
mu[i] = mean(slack)/mu[1]
end
end
function compute_many2(N,n,d,How_much)
D = HybridArray{Tuple{d,StaticArrays.Dynamic()}}(reshape(exp.(randn(d*N)),(d,N)))
slack = zeros(N)
data = zeros(N)
mu = zeros(n)
for i in 1:How_much
e = SVector{d}(rand(d))
compute_exp_moments2!(mu,D,e,slack,data)
end
return mu
end
@btime compute_many2(10000,10,20,100) #16 ms, better.
ProfileView.@profview compute_many2(10000,10,20,1000)
```

But I am still looking for performance. This stripped-down example accounts for 75% of my actual runtime.

Is there anything more that I can do ?