# Could this kernel go faster?

Hi,

I have the follwing problem :

``````using Statistics, BenchmarkTools, ProfileView

function compute_exp_moments!(mu,D,e,slack,data)
data .= D'e # most expensive line according to profileview.
slack .= exp.(.-data)
mu[1] = mean(slack)
for i in 2:length(mu)
slack .*= data
mu[i] = mean(slack)/mu[1]
end
end

function compute_many(N,n,d,How_much)
D = reshape(exp.(randn(d*N)),(d,N))
slack = zeros(N)
data = zeros(N)
mu = zeros(n)
for i in 1:How_much
e = rand(d)
compute_exp_moments!(mu,D,e,slack,data)
end
return mu
end

@btime compute_many(10000,10,20,100) # 20ms for me
ProfileView.@profview compute_many(10000,10,20,1000)
``````

The computation seems rather straightforward, but i have trouble making it faster. Much of the time is spend on the first matrix-vector product, as expected.

I tried Loopvectorisation together with StaticArrays and HybridArrays, improved by 20% :

``````function compute_exp_moments2!(mu,D,e,slack,data)
@turbo data .= D'e # Still takes much of the time.
@turbo slack .= exp.(.-data)
mu[1] = mean(slack)
for i in 2:length(mu) # I did not manage to use @turbo on this one.
slack .*= data
mu[i] = mean(slack)/mu[1]
end
end
function compute_many2(N,n,d,How_much)
D = HybridArray{Tuple{d,StaticArrays.Dynamic()}}(reshape(exp.(randn(d*N)),(d,N)))
slack = zeros(N)
data = zeros(N)
mu = zeros(n)
for i in 1:How_much
e = SVector{d}(rand(d))
compute_exp_moments2!(mu,D,e,slack,data)
end
return mu
end

@btime compute_many2(10000,10,20,100) #16 ms, better.
ProfileView.@profview compute_many2(10000,10,20,1000)
``````

But I am still looking for performance. This stripped-down example accounts for 75% of my actual runtime.

Is there anything more that I can do ?

You could should preallocate `e` and use `rand!(e)` in the loop. Also, you can use `mul!(data, D', e)` to avoid allocations for `D'e`.

``````using Statistics, BenchmarkTools, Random, LinearAlgebra

function compute_exp_moments!(mu, D, e, slack, data)
mul!(data, D', e)
slack .= exp.(.-data)
mu[1] = mean(slack)
@inbounds for i in 2:length(mu)
slack .*= data
mu[i] = mean(slack) / mu[1]
end
end

function compute_many(N, n, d, How_much)
D = reshape(exp.(randn(d * N)), (d, N))
slack = zeros(N)
data = zeros(N)
mu = zeros(n)
e = zeros(d)
for i in 1:How_much
rand!(e)
compute_exp_moments!(mu, D, e, slack, data)
end
return mu
end
``````

Timings:

``````# original: 22.240 ms (311 allocations: 10.86 MiB)
# prealloc e + rand!(e): 22.403 ms (212 allocations: 10.84 MiB)
# mul!(data, D', e): 16.343 ms (12 allocations: 3.20 MiB)
``````
2 Likes

Iām missing the usual reference to `using MKL` yet;)

1 Like

Thanks to all of you, this code is getting better and better. However, this post is messy and I have managed to pin out more precisely my bottleneck. Therefore I reposted it there : Is this the maximum perf i can obtain?

1 Like