Is this the maximum perf i can obtain?

lrnv · January 4, 2022, 9:50am

Hi,

I have a simple operation to make, which represents 80% of my runtime. I am trying to make it as fast as possible… Here is the sketch of the problem :

using Statistics, BenchmarkTools
function exp_and_mean!(slack,D,N)
    slack .= exp.(-D)
    return mean(slack)
end
function prod_and_mean!(slack,D,N)
    slack .*= D
    return mean(slack)
end
function runtime_test!(rez,slack,D)
    n = length(rez)
    N = length(D)
    rez[1] = exp_and_mean!(slack,D,N)
    for i in 1:n-1
        rez[i+1] = prod_and_mean!(slack,D,N)
    end
    return rez
end

# Simple but typical use-case : 
@benchmark runtime_test!(pars...) setup=(pars=(rez=zeros(20), slack=zeros(10000), D=exp.(randn(10000))))

Which outputs :

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   95.300 μs …   5.751 ms  ┊ GC (min … max): 0.00% … 97.30%
 Time  (median):     121.900 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   132.294 μs ± 134.750 μs  ┊ GC (mean ± σ):  3.53% ±  3.56%

   ▇       █▅    
  ▁██▅▅▃▂▂▂██▇▅▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  95.3 μs          Histogram: frequency by time          257 μs <

 Memory estimate: 78.20 KiB, allocs estimate: 2.

The first thing I have done was removing allocations and vectorising the loops :

# Second version : 
using LoopVectorization
function exp_and_mean!(slack,D,N)
    zz = zero(eltype(D))
    @turbo for i in 1:N
        slack[i] = exp(-D[i])
        zz += slack[i]
    end
    zz /= N
    return zz
end
function prod_and_mean!(slack,D,N)
    zz = zero(eltype(D))
    @turbo for i in 1:N
        slack[i] *= D[i]
        zz += slack[i]
    end
    zz /= N
    return zz
end
@benchmark runtime_test!(pars...) setup=(pars=(rez=zeros(20), slack=zeros(10000), D=exp.(randn(10000))))

Which outputs :

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  37.500 μs … 164.800 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     41.100 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   42.453 μs ±   6.724 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃▇██▆▆▇▇▇▆▅▄▃▁▁            ▁                                 ▂
  ████████████████▇▇▆▆▅▆▅▇▇▆▇█▇▇█▇▇▇█▇▇▇▇▇▅▅▆▅▅▅▅▅▅▃▅▅▄▄▅▆▅▅▅▅ █
  37.5 μs       Histogram: log(frequency) by time        75 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

This is a lot better, as the time was almost divided by 3.

I was wondering if we could do more, if I still have problems, or if I am fighting against the bare metal here. What do you think ?

Samuel3008 · January 4, 2022, 10:27am

This would be well suited for a GPU. Do you happen to have access to one?

lrnv · January 4, 2022, 10:30am

I do have an AMD one on my machine, but i am not sure i want to go in this rabbithole… This code is part of my loss function, and gets evaluated billions of times. If i understood correclty, each time the data should be passed on to the GPU, and then passed back to the CPU for the rest of the code (which might not run on gpu) to be evaluated.

IlianPihlajamaa · January 4, 2022, 11:12am

Disclaimer: I am on my phone so I can’t benchmark anything. To me the most straightforward optimization here is to do multithreading. Changing @turbo to @tturbo might work out of the box (you need to start Julia with multiple threads). Check the results to be safe, though.

This of course assumes you have multiple cores at your disposal.

lrnv · January 4, 2022, 11:55am

Indeed, on my 4 cores laptop with just tturbo instead of turbo, i have now :

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  16.500 μs …  5.564 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     25.900 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   28.590 μs ± 63.139 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂▇▇▃▄▄▇███▇▅▄▃▂▁    ▁▁▁                                     ▃
  ████████████████████████▇█▇▇▆▇▆▇▅▆▇▆▆▅▆▆▅▄▄▅▆▄▄▆▄▂▄▄▅▄▄▃▃▄▄ █
  16.5 μs      Histogram: log(frequency) by time      84.8 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Which is ore than twice faster. Very neat. Is this the best thing we can do ?

rafael.guerra · January 4, 2022, 2:19pm

Does @fastmath help for the exp() computations?

lrnv · January 4, 2022, 3:07pm

In facts, @turbo and @tturbo already implies @fastmath. I checked and the results seems correct.

Bardo · January 4, 2022, 6:04pm

Just a thought: Is it just for the example, or are you actually calculating the first moment of distributions?

Then it might be possible to think of improving the sampling method, resulting in fewer points, using surrogates, or something more analytic like continuous or piece-wise linear integrals, only sampling the tails, …

lrnv · January 4, 2022, 6:27pm

I am computing these moments for real, yes, but the data is here a fake data.

If you known the density, e.g. for a lognormal, I already have coded up a better integration procedure there : https://github.com/lrnv/ThorinDistributions.jl/blob/main/src/MFKProjection.jl

But here I only have the data, of unknown distribution

djholiver · January 4, 2022, 8:24pm

Hi,

Why do you need to allocate to slack (or even provide it as a variable) in the two methods?

 using LoopVectorization
function exp_and_mean!(D,N)
    zz = zero(eltype(D))
    @tturbo for i in 1:N        
        zz +=  exp(-D[i])  
    end
    zz /= N
    return zz
end
function prod_and_mean!(D,N)
    zz = one(eltype(D))
    @tturbo for i in 1:N
        zz *= D[i]
    end
    zz /= N
    return zz
end

is there also potentially an error in “prod_and_mean” ? do you mean to sum zz (I change to multiple above).

From your input setup vars, you are passing in slack as a vector of zeros so doesnt this yield zero for all cases in “prod_and_mean” ? Obviously they may be intended.

Do you have any opportunity to modify all “Ds” before being passing into the functions at all or are they always generated before being called? You might benefit from creating the entire “D” matrix space in exp and not exp forms before executing your functions (you’d also be able to combine them if this is applicable).

Regards,

lrnv · January 5, 2022, 9:55am

because I actually need the result for later computations. Look again at my runtime_test! function:

function runtime_test!(rez,slack,D)
    n = length(rez)
    N = length(D)
    rez[1] = exp_and_mean!(slack,D,N)
    for i in 1:n-1
        rez[i+1] = prod_and_mean!(slack,D,N)
    end
    return rez
end

You see that the slack is passed along from one function to the other. Actually, if you denote by X the rndom variable that corresponds to the data D, my function computes \mathbb E\left(X^k e^{-X}\right), for k \in \{0,...,n-1\}.

No there is no error in “prod_and_mean”, I indeed mean to sum zz. Yes this is intended, since slack is not expected to be zeros when calling the prod_and_mean function, look again at the runtime_test! function

I do not understand what you mean. This computation is done only once for a given dataset D, I am afraid. The dataset D changes from one execution to the next, albeit not completely randomly: I compute it from another dataset data as follows:

data=.... #(sized (10,10000)
for i in 1:n_iterations 
    e = rand(10) # A new one is picked at each iteration 
    D = data'e # now a vector of size 10000
    runtime_test!(rez,slack,D)
    do_something_with!(rez)
end

data is fixed from one iteration to the other, but e is not (and therefore neither D, slack and rez). rez is needed for the rest of the computations, but D and slack are not.

Topic		Replies	Views
Could this kernel go faster? Performance	3	826	January 4, 2022
Why are calls to this function slow, compared to Python+Numpy? General Usage fast-math	14	2168	August 13, 2017
Why is this Julia code considerably slower than Matlab New to Julia performance	64	8624	March 5, 2017
Need help on improving performance of a small subfunction Performance	21	966	May 4, 2020
Boltzmann Factor in a loop Performance	38	1906	April 3, 2018

Is this the maximum perf i can obtain?

Related topics