Is this the maximum perf i can obtain?

Hi,

I have a simple operation to make, which represents 80% of my runtime. I am trying to make it as fast as possible… Here is the sketch of the problem :

using Statistics, BenchmarkTools
function exp_and_mean!(slack,D,N)
    slack .= exp.(-D)
    return mean(slack)
end
function prod_and_mean!(slack,D,N)
    slack .*= D
    return mean(slack)
end
function runtime_test!(rez,slack,D)
    n = length(rez)
    N = length(D)
    rez[1] = exp_and_mean!(slack,D,N)
    for i in 1:n-1
        rez[i+1] = prod_and_mean!(slack,D,N)
    end
    return rez
end

# Simple but typical use-case : 
@benchmark runtime_test!(pars...) setup=(pars=(rez=zeros(20), slack=zeros(10000), D=exp.(randn(10000))))

Which outputs :

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   95.300 ΞΌs …   5.751 ms  β”Š GC (min … max): 0.00% … 97.30%
 Time  (median):     121.900 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   132.294 ΞΌs Β± 134.750 ΞΌs  β”Š GC (mean Β± Οƒ):  3.53% Β±  3.56%

   β–‡       β–ˆβ–…    
  β–β–ˆβ–ˆβ–…β–…β–ƒβ–‚β–‚β–‚β–ˆβ–ˆβ–‡β–…β–„β–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β– β–‚
  95.3 ΞΌs          Histogram: frequency by time          257 ΞΌs <

 Memory estimate: 78.20 KiB, allocs estimate: 2.

The first thing I have done was removing allocations and vectorising the loops :

# Second version : 
using LoopVectorization
function exp_and_mean!(slack,D,N)
    zz = zero(eltype(D))
    @turbo for i in 1:N
        slack[i] = exp(-D[i])
        zz += slack[i]
    end
    zz /= N
    return zz
end
function prod_and_mean!(slack,D,N)
    zz = zero(eltype(D))
    @turbo for i in 1:N
        slack[i] *= D[i]
        zz += slack[i]
    end
    zz /= N
    return zz
end
@benchmark runtime_test!(pars...) setup=(pars=(rez=zeros(20), slack=zeros(10000), D=exp.(randn(10000))))

Which outputs :

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  37.500 ΞΌs … 164.800 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     41.100 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   42.453 ΞΌs Β±   6.724 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ƒβ–‡β–ˆβ–ˆβ–†β–†β–‡β–‡β–‡β–†β–…β–„β–ƒβ–β–            ▁                                 β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–†β–…β–†β–…β–‡β–‡β–†β–‡β–ˆβ–‡β–‡β–ˆβ–‡β–‡β–‡β–ˆβ–‡β–‡β–‡β–‡β–‡β–…β–…β–†β–…β–…β–…β–…β–…β–…β–ƒβ–…β–…β–„β–„β–…β–†β–…β–…β–…β–… β–ˆ
  37.5 ΞΌs       Histogram: log(frequency) by time        75 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

This is a lot better, as the time was almost divided by 3.

I was wondering if we could do more, if I still have problems, or if I am fighting against the bare metal here. What do you think ?

This would be well suited for a GPU. Do you happen to have access to one?

I do have an AMD one on my machine, but i am not sure i want to go in this rabbithole… This code is part of my loss function, and gets evaluated billions of times. If i understood correclty, each time the data should be passed on to the GPU, and then passed back to the CPU for the rest of the code (which might not run on gpu) to be evaluated.

Disclaimer: I am on my phone so I can’t benchmark anything. To me the most straightforward optimization here is to do multithreading. Changing @turbo to @tturbo might work out of the box (you need to start Julia with multiple threads). Check the results to be safe, though.

This of course assumes you have multiple cores at your disposal.

Indeed, on my 4 cores laptop with just tturbo instead of turbo, i have now :

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  16.500 ΞΌs …  5.564 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     25.900 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   28.590 ΞΌs Β± 63.139 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–‚β–‡β–‡β–ƒβ–„β–„β–‡β–ˆβ–ˆβ–ˆβ–‡β–…β–„β–ƒβ–‚β–    ▁▁▁                                     β–ƒ
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–‡β–‡β–†β–‡β–†β–‡β–…β–†β–‡β–†β–†β–…β–†β–†β–…β–„β–„β–…β–†β–„β–„β–†β–„β–‚β–„β–„β–…β–„β–„β–ƒβ–ƒβ–„β–„ β–ˆ
  16.5 ΞΌs      Histogram: log(frequency) by time      84.8 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Which is ore than twice faster. Very neat. Is this the best thing we can do ?

Does @fastmath help for the exp() computations?

In facts, @turbo and @tturbo already implies @fastmath. I checked and the results seems correct.

1 Like

Just a thought: Is it just for the example, or are you actually calculating the first moment of distributions?

Then it might be possible to think of improving the sampling method, resulting in fewer points, using surrogates, or something more analytic like continuous or piece-wise linear integrals, only sampling the tails, …

1 Like

I am computing these moments for real, yes, but the data is here a fake data.

If you known the density, e.g. for a lognormal, I already have coded up a better integration procedure there : https://github.com/lrnv/ThorinDistributions.jl/blob/main/src/MFKProjection.jl

But here I only have the data, of unknown distribution :slight_smile:

1 Like

Hi,

Why do you need to allocate to slack (or even provide it as a variable) in the two methods?

 using LoopVectorization
function exp_and_mean!(D,N)
    zz = zero(eltype(D))
    @tturbo for i in 1:N        
        zz +=  exp(-D[i])  
    end
    zz /= N
    return zz
end
function prod_and_mean!(D,N)
    zz = one(eltype(D))
    @tturbo for i in 1:N
        zz *= D[i]
    end
    zz /= N
    return zz
end

is there also potentially an error in β€œprod_and_mean” ? do you mean to sum zz (I change to multiple above).

From your input setup vars, you are passing in slack as a vector of zeros so doesnt this yield zero for all cases in β€œprod_and_mean” ? Obviously they may be intended.

Do you have any opportunity to modify all β€œDs” before being passing into the functions at all or are they always generated before being called? You might benefit from creating the entire β€œD” matrix space in exp and not exp forms before executing your functions (you’d also be able to combine them if this is applicable).

Regards,

because I actually need the result for later computations. Look again at my runtime_test! function:

function runtime_test!(rez,slack,D)
    n = length(rez)
    N = length(D)
    rez[1] = exp_and_mean!(slack,D,N)
    for i in 1:n-1
        rez[i+1] = prod_and_mean!(slack,D,N)
    end
    return rez
end

You see that the slack is passed along from one function to the other. Actually, if you denote by X the rndom variable that corresponds to the data D, my function computes \mathbb E\left(X^k e^{-X}\right), for k \in \{0,...,n-1\}.

No there is no error in β€œprod_and_mean”, I indeed mean to sum zz. Yes this is intended, since slack is not expected to be zeros when calling the prod_and_mean function, look again at the runtime_test! function :slight_smile:

I do not understand what you mean. This computation is done only once for a given dataset D, I am afraid. The dataset D changes from one execution to the next, albeit not completely randomly: I compute it from another dataset data as follows:

data=.... #(sized (10,10000)
for i in 1:n_iterations 
    e = rand(10) # A new one is picked at each iteration 
    D = data'e # now a vector of size 10000
    runtime_test!(rez,slack,D)
    do_something_with!(rez)
end

data is fixed from one iteration to the other, but e is not (and therefore neither D, slack and rez). rez is needed for the rest of the computations, but D and slack are not.