Performance issue with this bootstrapp

lrnv · June 25, 2021, 1:51pm

Hi,

I have the following piece of code that performs a simple bootstrap, but which is taking a very long time. I was wondering If I made obvious mistakes in the implmentation, although @code_warntype seems to tell me everything’s alright on the type-stability side. Here’s the MWE :

using Distributions, Random, Statistics, Test, BenchmarkTools

function f(M,D::Vector{T},t_star,m::Integer) where T
    n = length(D)
    mu_mat = D .^(0:(m+1))' .* exp.(t_star .* D)
    κ = zeros(T,(M,m+1))
    μ = zeros(T,(M,m+2))
    facts = factorial.(big.(0:m-1))
    bins = binomial.(big.(0:m-1),big.(0:m-1)')
    # this first loop has independant executions, it's the bootstrap loop and can be reordered as you want.
    @inbounds for i in 1:M
        μ[i,:] =  mean(mu_mat[sample(1:n,n,replace=true),:],dims=1)
        μ[i,2:end] ./= μ[i,1]
        κ[i,1:2] .= log(μ[i,1]),μ[i,2]
        for k in 3:(m+1) # this loop cannot be reordered.
            κ[i,k] = μ[i,k]
            for j in 1:(k-2)
                κ[i,k] -= bins[k-1,j]*κ[i,j+1]*μ[i,k-j]
            end
        end
        κ[i,2:end] ./= facts
    end
    return κ
end

Random.seed!(123)
D = BigFloat.(rand(LogNormal(),10000));
M = 10 # should be 10000
@btime f(M,D,-1,20)

Do you see something obvious that hurts my perfs ?

Edit : I posted a better version of the MWE, making first comment from @Oscar_Smith irrelevant, sorry.

Oscar_Smith · June 25, 2021, 2:32pm

This has a bunch of inefficiencies. One that is pretty easy to fix is that BellPartialMatrix could just be

function Bell_partial_matrix(N,K,x)
    M = zeros(eltype(x .+ 1.0),(N+1,K+1))
    M[1,1] = 1
    for n in 0:N
        for k in 0:n
            M[n+1,k+1] = sum([binomial(n-1,i-1)*x[i]*M[n-i+1,k] for i in 1:n-k+1])
        end
    end
    return M
end

lrnv · June 25, 2021, 2:33pm

Thanks for the catch. Indeed.

However this is not the part of the code that actually runs, this part is only for testing purposes so I do not care. Only the resemps_cum_from_data and cum_from_mom_rec! functions are of interest.

EDIT: I uploaded a better version of the MWE, so previous comment from @Oscar_Smith is now irrelevant, sorry.

Oscar_Smith · June 27, 2021, 12:26am

How does the new version perform? It looks like it should be a lot better.

lrnv · June 27, 2021, 8:11am

It does perform 10times better, but it is still very slow and allocate a lot.

Oscar_Smith · June 27, 2021, 8:49am

You can get another 60x better speed and remove almost all of the memory allocation by using Float64 instead of BigFloat. However doing this and maintaining accuracy requires being smarter about how you’re computing this. Two big things to get you started. lgamma and logabsbinomial from SpecialFunctions.jl give you accurate log(factorial(x+1)) and log(binomial(m,n)). To make full use of this, you will have to rethink a lot of your algorithm to work in the log domain, but this should help performance a ton.

lrnv · June 27, 2021, 9:04am

Take a look at the values of \mu and \kappa in my example, you’ll see why this is impossible to do. Changing the algorithm is going to be Very Hard, and making it stable enough to work on Float64 is clearly not possible… Indeed, that’s a pity

Oscar_Smith · June 27, 2021, 9:06am

Looking at this a tiny bit closer, I’ve found that you can go from cubic to quadratic by removing the inner most loop.

function f(M,D::Vector{T},t_star,m::Integer) where T
    n = length(D)
    mu_mat = D .^(0:(m+1))' .* exp.(t_star .* D)
    κ = zeros(T,(M,m+1))
    μ = zeros(T,(M,m+2))
    facts = factorial.(big.(0:m-1))
    bins = binomial.(big.(0:m-1),big.(0:m-1)')
    # this first loop has independant executions, it's the bootstrap loop and can be reordered as you want.
    @inbounds for i in 1:M
        μ[i,:] =  mean(mu_mat[sample(1:n,n,replace=true),:],dims=1)
        μ[i,2:end] ./= μ[i,1]
        κ[i,1:2] .= log(μ[i,1]),μ[i,2]
        for k in 3:(m+1) # this loop cannot be reordered.
            κ[i,k] = μ[i,k] - μ[i,k-1] - bins[k-1,k-1]*κ[i,k-1]*μ[i,2] + κ[i,k-1]
        end
        κ[i,2:end] ./= facts
    end
    return κ
end

lrnv · June 28, 2021, 7:45am

Hum, this is clearly not the same result ? I called yours f2 and ran:

Random.seed!(123)
D = BigFloat.(rand(LogNormal(),10000));
M = 10 # should be 10000
@btime f(M,D,-1,20)
@btime f2(M,D,-1,20)
Random.seed!(123);
x1 = f(M,D,-1,20);
Random.seed!(123);
x2 = f2(M,D,-1,20);
print(x1 .- x2)

And obtained:

julia> x1 .- x2
10×21 Matrix{BigFloat}:
 0.0  0.0  0.0  -0.131225  -0.199557  -0.195749  …  -0.0207983  -0.0176833  -0.0149706  -0.0125858
 0.0  0.0  0.0  -0.128825  -0.198647  -0.197519     -0.0174794  -0.0144286  -0.0119604  -0.00994301      
 0.0  0.0  0.0  -0.130347  -0.20012   -0.198427     -0.0196492  -0.0164244  -0.0137693  -0.0115894       
 0.0  0.0  0.0  -0.131826  -0.200053  -0.196572     -0.0181401  -0.0144528  -0.011456   -0.00907264      
 0.0  0.0  0.0  -0.125221  -0.192597  -0.190085     -0.0177587  -0.0151358  -0.0129843  -0.0112031       
 0.0  0.0  0.0  -0.133855  -0.204578  -0.202867  …  -0.0167944  -0.0136078  -0.01112    -0.00916617      
 0.0  0.0  0.0  -0.129762  -0.198722  -0.197835     -0.0176968  -0.0144141  -0.0118796  -0.00993722      
 0.0  0.0  0.0  -0.130312  -0.199037  -0.196306     -0.0184909  -0.0151889  -0.0125069  -0.0103279       
 0.0  0.0  0.0  -0.137049  -0.207016  -0.203595     -0.0176319  -0.0146333  -0.0121903  -0.0101957       
 0.0  0.0  0.0  -0.130036  -0.200019  -0.199322     -0.019396   -0.0158364  -0.0128925  -0.0104616

lrnv · June 29, 2021, 7:59am

It seems like all the binomials and factorials were not needed.
I separated kernel functions from the rest of the code, and now I have the following:

function ηs_from_data!(ηs,D,t_star)
    ηs[:,1] = exp.(t_star .* D)
    m = size(ηs,2)
    for i=1:(m-1)
        ηs[:,i+1] = ηs[:,i] .* D ./ i
    end
end
function τ_from_η!(τ,η)
    η[2:end] ./= η[1]
    τ[1] = log(η[1])
    τ[2] = η[2]
    for k in 3:length(η)
        τ[k] = (k-1)*η[k]
        for j in 1:(k-2)
            τ[k] -= τ[j+1] * η[k-j]
        end
    end
    return τ
end
function thorin_moments(D::Vector{T},t_star,m) where T
    n = length(D)
    τ = zeros(T,m+1)
    η = zeros(T,m+1)
    ηs = zeros(T,(n,m+1))
    @views ηs_from_data!(ηs,D,t_star)
    η = dropdims(Statistics.mean(ηs,dims=1),dims=1)
    τ_from_η!(τ,η)
    return τ
end
function resemps_thorin_moments(M,D::AbstractMatrix{T},t_star,m) where {T}
    d,n = size(D)
    ηs = zeros(T,(d,n,m+1))
    τ = zeros(T,(M,d,m+1))
    η = zeros(T,(M,d,m+1))
    for dim in 1:d
        @views ηs_from_data!(ηs[dim,:,:],D[dim,:],t_star)
    end
    Threads.@threads for i in 1:M
        println(i)
        η[i,:,:] = dropdims(Statistics.mean(ηs[:,StatsBase.sample(1:n,n,replace=true),:],dims=2),dims=2)
        for dim in 1:d
           @views τ_from_η!(τ[i,dim,:],η[i,dim,:])
        end
    end
    return τ
end
function resemps_thorin_moments(M,D::Vector{T},t_star,m) where T
    return resemps_thorin_moments(M,reshape(D,(1,length(D))),t_star,m)[:,1,:]
end


# Simple test case
N,M,m = 1000,100,20 # My goal is 10000,10000,40
D = randn(N) .^2
D2 = D .+ (randn(N).^2)/3
D2 = BigFloat.([D D2])'

rez0 = thorin_moments(D,-1,20) # should work
rez1 = resemps_thorin_moments(M,D,-1,20) # should work
rez2 = resemps_thorin_moments(M,D2,-1,20) # should be as fast as possible !

Could you give me your opinion on the code ? It seems to be type stable, I tried Traceur.@trace but could’nt understand the results.

mikkoku · June 29, 2021, 12:35pm

Profiler seems to suggest that majority of the computation is happening at Statistics.mean. Is that reasonable?

lrnv · June 29, 2021, 1:34pm

Well… At first glance I think that no, it is not where the majority of the time should be.

However, since we deal with bigfloats, maybe Statistics.mean is using generic code for BigFloats that allocates a lot.

Topic		Replies	Views
Help Improving Performance of a Loop Performance performance , loops	15	1088	February 16, 2021
Strange performance problem Performance question , type-stability	14	424	June 7, 2024
Bootstrapping is fast on M1 Max, slow on HPC Knights Landing / Broadwell Performance statistics , performance	12	1759	October 31, 2022
Efficient Implementation of Loop for Bootstrapping Statistics Performance loops	5	2608	June 27, 2019
New package DependentBootstrap Statistics package , announcement	3	841	December 17, 2021

Performance issue with this bootstrapp

Related topics