Performance issue with this bootstrapp

Hi,

I have the following piece of code that performs a simple bootstrap, but which is taking a very long time. I was wondering If I made obvious mistakes in the implmentation, although @code_warntype seems to tell me everything’s alright on the type-stability side. Here’s the MWE :

using Distributions, Random, Statistics, Test, BenchmarkTools

function f(M,D::Vector{T},t_star,m::Integer) where T
    n = length(D)
    mu_mat = D .^(0:(m+1))' .* exp.(t_star .* D)
    κ = zeros(T,(M,m+1))
    μ = zeros(T,(M,m+2))
    facts = factorial.(big.(0:m-1))
    bins = binomial.(big.(0:m-1),big.(0:m-1)')
    # this first loop has independant executions, it's the bootstrap loop and can be reordered as you want.
    @inbounds for i in 1:M
        μ[i,:] =  mean(mu_mat[sample(1:n,n,replace=true),:],dims=1)
        μ[i,2:end] ./= μ[i,1]
        κ[i,1:2] .= log(μ[i,1]),μ[i,2]
        for k in 3:(m+1) # this loop cannot be reordered.
            κ[i,k] = μ[i,k]
            for j in 1:(k-2)
                κ[i,k] -= bins[k-1,j]*κ[i,j+1]*μ[i,k-j]
            end
        end
        κ[i,2:end] ./= facts
    end
    return κ
end

Random.seed!(123)
D = BigFloat.(rand(LogNormal(),10000));
M = 10 # should be 10000
@btime f(M,D,-1,20)

Do you see something obvious that hurts my perfs ?


Edit : I posted a better version of the MWE, making first comment from @Oscar_Smith irrelevant, sorry.

This has a bunch of inefficiencies. One that is pretty easy to fix is that BellPartialMatrix could just be

function Bell_partial_matrix(N,K,x)
    M = zeros(eltype(x .+ 1.0),(N+1,K+1))
    M[1,1] = 1
    for n in 0:N
        for k in 0:n
            M[n+1,k+1] = sum([binomial(n-1,i-1)*x[i]*M[n-i+1,k] for i in 1:n-k+1])
        end
    end
    return M
end

Thanks for the catch. Indeed.

However this is not the part of the code that actually runs, this part is only for testing purposes so I do not care. Only the resemps_cum_from_data and cum_from_mom_rec! functions are of interest.


EDIT: I uploaded a better version of the MWE, so previous comment from @Oscar_Smith is now irrelevant, sorry.

How does the new version perform? It looks like it should be a lot better.

It does perform 10times better, but it is still very slow and allocate a lot.

You can get another 60x better speed and remove almost all of the memory allocation by using Float64 instead of BigFloat. However doing this and maintaining accuracy requires being smarter about how you’re computing this. Two big things to get you started. lgamma and logabsbinomial from SpecialFunctions.jl give you accurate log(factorial(x+1)) and log(binomial(m,n)). To make full use of this, you will have to rethink a lot of your algorithm to work in the log domain, but this should help performance a ton.

1 Like

Take a look at the values of \mu and \kappa in my example, you’ll see why this is impossible to do. Changing the algorithm is going to be Very Hard, and making it stable enough to work on Float64 is clearly not possible… Indeed, that’s a pity :slight_smile:

Looking at this a tiny bit closer, I’ve found that you can go from cubic to quadratic by removing the inner most loop.

function f(M,D::Vector{T},t_star,m::Integer) where T
    n = length(D)
    mu_mat = D .^(0:(m+1))' .* exp.(t_star .* D)
    κ = zeros(T,(M,m+1))
    μ = zeros(T,(M,m+2))
    facts = factorial.(big.(0:m-1))
    bins = binomial.(big.(0:m-1),big.(0:m-1)')
    # this first loop has independant executions, it's the bootstrap loop and can be reordered as you want.
    @inbounds for i in 1:M
        μ[i,:] =  mean(mu_mat[sample(1:n,n,replace=true),:],dims=1)
        μ[i,2:end] ./= μ[i,1]
        κ[i,1:2] .= log(μ[i,1]),μ[i,2]
        for k in 3:(m+1) # this loop cannot be reordered.
            κ[i,k] = μ[i,k] - μ[i,k-1] - bins[k-1,k-1]*κ[i,k-1]*μ[i,2] + κ[i,k-1]
        end
        κ[i,2:end] ./= facts
    end
    return κ
end

Hum, this is clearly not the same result ? I called yours f2 and ran:

Random.seed!(123)
D = BigFloat.(rand(LogNormal(),10000));
M = 10 # should be 10000
@btime f(M,D,-1,20)
@btime f2(M,D,-1,20)
Random.seed!(123);
x1 = f(M,D,-1,20);
Random.seed!(123);
x2 = f2(M,D,-1,20);
print(x1 .- x2)

And obtained:

julia> x1 .- x2
10×21 Matrix{BigFloat}:
 0.0  0.0  0.0  -0.131225  -0.199557  -0.195749  …  -0.0207983  -0.0176833  -0.0149706  -0.0125858
 0.0  0.0  0.0  -0.128825  -0.198647  -0.197519     -0.0174794  -0.0144286  -0.0119604  -0.00994301      
 0.0  0.0  0.0  -0.130347  -0.20012   -0.198427     -0.0196492  -0.0164244  -0.0137693  -0.0115894       
 0.0  0.0  0.0  -0.131826  -0.200053  -0.196572     -0.0181401  -0.0144528  -0.011456   -0.00907264      
 0.0  0.0  0.0  -0.125221  -0.192597  -0.190085     -0.0177587  -0.0151358  -0.0129843  -0.0112031       
 0.0  0.0  0.0  -0.133855  -0.204578  -0.202867  …  -0.0167944  -0.0136078  -0.01112    -0.00916617      
 0.0  0.0  0.0  -0.129762  -0.198722  -0.197835     -0.0176968  -0.0144141  -0.0118796  -0.00993722      
 0.0  0.0  0.0  -0.130312  -0.199037  -0.196306     -0.0184909  -0.0151889  -0.0125069  -0.0103279       
 0.0  0.0  0.0  -0.137049  -0.207016  -0.203595     -0.0176319  -0.0146333  -0.0121903  -0.0101957       
 0.0  0.0  0.0  -0.130036  -0.200019  -0.199322     -0.019396   -0.0158364  -0.0128925  -0.0104616

It seems like all the binomials and factorials were not needed.
I separated kernel functions from the rest of the code, and now I have the following:

function ηs_from_data!(ηs,D,t_star)
    ηs[:,1] = exp.(t_star .* D)
    m = size(ηs,2)
    for i=1:(m-1)
        ηs[:,i+1] = ηs[:,i] .* D ./ i
    end
end
function τ_from_η!(τ,η)
    η[2:end] ./= η[1]
    τ[1] = log(η[1])
    τ[2] = η[2]
    for k in 3:length(η)
        τ[k] = (k-1)*η[k]
        for j in 1:(k-2)
            τ[k] -= τ[j+1] * η[k-j]
        end
    end
    return τ
end
function thorin_moments(D::Vector{T},t_star,m) where T
    n = length(D)
    τ = zeros(T,m+1)
    η = zeros(T,m+1)
    ηs = zeros(T,(n,m+1))
    @views ηs_from_data!(ηs,D,t_star)
    η = dropdims(Statistics.mean(ηs,dims=1),dims=1)
    τ_from_η!(τ,η)
    return τ
end
function resemps_thorin_moments(M,D::AbstractMatrix{T},t_star,m) where {T}
    d,n = size(D)
    ηs = zeros(T,(d,n,m+1))
    τ = zeros(T,(M,d,m+1))
    η = zeros(T,(M,d,m+1))
    for dim in 1:d
        @views ηs_from_data!(ηs[dim,:,:],D[dim,:],t_star)
    end
    Threads.@threads for i in 1:M
        println(i)
        η[i,:,:] = dropdims(Statistics.mean(ηs[:,StatsBase.sample(1:n,n,replace=true),:],dims=2),dims=2)
        for dim in 1:d
           @views τ_from_η!(τ[i,dim,:],η[i,dim,:])
        end
    end
    return τ
end
function resemps_thorin_moments(M,D::Vector{T},t_star,m) where T
    return resemps_thorin_moments(M,reshape(D,(1,length(D))),t_star,m)[:,1,:]
end


# Simple test case
N,M,m = 1000,100,20 # My goal is 10000,10000,40
D = randn(N) .^2
D2 = D .+ (randn(N).^2)/3
D2 = BigFloat.([D D2])'

rez0 = thorin_moments(D,-1,20) # should work
rez1 = resemps_thorin_moments(M,D,-1,20) # should work
rez2 = resemps_thorin_moments(M,D2,-1,20) # should be as fast as possible !

Could you give me your opinion on the code ? It seems to be type stable, I tried Traceur.@trace but could’nt understand the results.

Profiler seems to suggest that majority of the computation is happening at Statistics.mean. Is that reasonable?

Well… At first glance I think that no, it is not where the majority of the time should be.

However, since we deal with bigfloats, maybe Statistics.mean is using generic code for BigFloats that allocates a lot.