Could this kernel go faster?

Hi,

I have the follwing problem :

using Statistics, BenchmarkTools, ProfileView

function compute_exp_moments!(mu,D,e,slack,data)
    data .= D'e # most expensive line according to profileview. 
    slack .= exp.(.-data)
    mu[1] = mean(slack)
    for i in 2:length(mu)
        slack .*= data
        mu[i] = mean(slack)/mu[1]
    end
end

function compute_many(N,n,d,How_much)
    D = reshape(exp.(randn(d*N)),(d,N))
    slack = zeros(N)
    data = zeros(N)
    mu = zeros(n)
    for i in 1:How_much
        e = rand(d)
        compute_exp_moments!(mu,D,e,slack,data)
    end
    return mu
end

@btime compute_many(10000,10,20,100) # 20ms for me 
ProfileView.@profview compute_many(10000,10,20,1000)

The computation seems rather straightforward, but i have trouble making it faster. Much of the time is spend on the first matrix-vector product, as expected.

I tried Loopvectorisation together with StaticArrays and HybridArrays, improved by 20% :

function compute_exp_moments2!(mu,D,e,slack,data)
    @turbo data .= D'e # Still takes much of the time. 
    @turbo slack .= exp.(.-data)
    mu[1] = mean(slack)
    for i in 2:length(mu) # I did not manage to use @turbo on this one. 
        slack .*= data
        mu[i] = mean(slack)/mu[1]
    end
end
function compute_many2(N,n,d,How_much)
    D = HybridArray{Tuple{d,StaticArrays.Dynamic()}}(reshape(exp.(randn(d*N)),(d,N)))
    slack = zeros(N)
    data = zeros(N)
    mu = zeros(n)
    for i in 1:How_much
        e = SVector{d}(rand(d))
        compute_exp_moments2!(mu,D,e,slack,data)
    end
    return mu
end

@btime compute_many2(10000,10,20,100) #16 ms, better. 
ProfileView.@profview compute_many2(10000,10,20,1000)

But I am still looking for performance. This stripped-down example accounts for 75% of my actual runtime.

Is there anything more that I can do ?

You could should preallocate e and use rand!(e) in the loop. Also, you can use mul!(data, D', e) to avoid allocations for D'e.

using Statistics, BenchmarkTools, Random, LinearAlgebra

function compute_exp_moments!(mu, D, e, slack, data)
    mul!(data, D', e)
    slack .= exp.(.-data)
    mu[1] = mean(slack)
    @inbounds for i in 2:length(mu)
        slack .*= data
        mu[i] = mean(slack) / mu[1]
    end
end

function compute_many(N, n, d, How_much)
    D = reshape(exp.(randn(d * N)), (d, N))
    slack = zeros(N)
    data = zeros(N)
    mu = zeros(n)
    e = zeros(d)
    for i in 1:How_much
        rand!(e)
        compute_exp_moments!(mu, D, e, slack, data)
    end
    return mu
end

Timings:

# original: 22.240 ms (311 allocations: 10.86 MiB)
# prealloc e + rand!(e): 22.403 ms (212 allocations: 10.84 MiB)
# mul!(data, D', e): 16.343 ms (12 allocations: 3.20 MiB)
2 Likes

Iā€™m missing the usual reference to using MKL yet;)

1 Like

Thanks to all of you, this code is getting better and better. However, this post is messy and I have managed to pin out more precisely my bottleneck. Therefore I reposted it there : Is this the maximum perf i can obtain?

1 Like