How to speed up the calculation of Logit likelihood?

Hi, I am currently using Julia to solve some econometric structural models which usually involve the calculation and optimization of Log-likelihood function. While the optimization can be done by Optim.jl, the calculation of loglikelihood needs hand-coding.

The question I face is that, how to calculate the log-likelihood function as fast as possible especially facing the unbalanced panel data as the input. The code below giving a logit implementation has the inputs X and y in which each individual has different number of observations, so I use Vector{Matrix{Float64}} to save X and use Vector{Vector{Int64}} to save y.

I wonder if this is an efficient way to calculate loglikehood function nll_threading? Is there any other techniques to handle such problems ? (PS: I already used many methods such as multi-threading, views to speed up; however it’s still slow especially when the loglikelihood function for each individual has a more complicated form, such as hidden Markov model, bayesian learning model)

using LogExpFunctions, Optim
using Random, Distributions
using LinearAlgebra:dot
using BenchmarkTools
const seed = Random.seed!(2022 - 1 - 3)
# data generation
begin
    Inds = 1000 # number of individuals
    nX = 4
    Tlen = rand([15, 10, 20], Inds) # number of observations of each individual
    X = Vector{Matrix{Float64}}()
    for i in eachindex(Tlen)
        push!(X, randn(seed, (Tlen[i], nX)))
    end
    beta = [1.0, 2, -3, 4]
    y = Vector{Vector{Int64}}()
    for i in 1:Inds
        Xi = X[i]
        yi = Vector{Int64}()
        for t in axes(Xi, 1)
            U = @views dot(Xi[t, :], beta)
            push!(yi, rand(BernoulliLogit(-U)))
        end
        push!(y, yi)
    end
end
# negative log-likelihood function (to optimize)
function nll_threading(X, beta::Vector{T}, y) where {T <: Real}
    ind_ll = zeros(T, Threads.nthreads())
    Threads.@threads for i in eachindex(y)
        Xi = X[i]
        yi = y[i] 
        for t in axes(Xi, 1)
            U = @views dot(Xi[t, :], beta)
            ind_ll[Threads.threadid()] += logpdf(BernoulliLogit(-U), yi[t])
        end
    end
    return - sum(ind_ll)
end
# benchmark
@benchmark nll_threading($X, $beta, $y)
# about 380ms for 6 threads with intel i5-10400f chip

# optimization
fit_optim = optimize(
    pars -> nll_threading(X, pars, y),
    zeros(4),
    Newton();
    autodiff = :forward
)
fit_optim.minimizer

Benchmark for current version

julia> @btime nll_threading($X, $beta, $y)
  551.400 μs (111 allocations: 12.39 KiB)

I think it might help to make X be a Matrix{SVector{4}} and beta be an SVector{4}. Doing so will make it so the dot product calculation knows ahead of time that it will be getting 4 long vectors. Another likely 2x speedup is switching to Float32 which should have enough accuracy but will take 2x less memory.

2 Likes

Another Question :slightly_smiling_face:: I also face another weird problem. If I run it in my local machine with 6 threads (i5 10400f, 6 physical processors), the CPU usage is high, over 80%; However, if run it in the remote windows server with 18 threads (i9 10980XE, 18 physical processors), the CPU usage is relatively low, at 40% approximately. It seems Julia didn’t utilize all threads as I asked for. I already set number of threads to 18 while starting the REPL.

Thanks. It seems helpful, I will post the benchmark results after implementing your ideas.

For some problems you can avoid computing the likelihood by grouping observations, e.g. :

D = Poisson(100)
x = rand(D,1000)
xc = countmap(x)

sum(logpdf(D,xi)*n for (xi,n) in xc) #56 iterations instead of 1000

But I guess this won’t help in your case, unless maybe if you round off U to some decimals.

Perhaps putting Threads.@threads before the inner loop,

for t in axes(Xi, 1)

might be better.

Might be worse.

# two implementations: one use vector to save each individual's likelihood, 
# the other use lock to avoid data race in multi-threading. 
# Both performances are worse than original version.
function nll_threading_inner1(X, beta::Vector{T}, y) where {T <: Real}
    ll = zero(T)
    @inbounds for i in eachindex(y)
        Xi = X[i]
        yi = y[i] 
        lli = zeros(T, length(yi))
        Threads.@threads for t in axes(Xi, 1)
            U = @views dot(Xi[t, :], beta)
            lli[t] = logpdf(BernoulliLogit(-U), yi[t])
        end
        ll += sum(lli)
    end
    return - ll
end

function nll_threading_inner2(X, beta::Vector{T}, y) where {T <: Real}
    ll = zero(T)
    l = ReentrantLock()
    @inbounds for i in eachindex(y)
        Xi = X[i]
        yi = y[i] 
        Threads.@threads for t in axes(Xi, 1)
            U = @views dot(Xi[t, :], beta)
            lock(l) do
                ll += logpdf(BernoulliLogit(-U), yi[t])
            end
        end
    end
    return - ll
end

Benchmark results

julia> @btime nll_threading_inner1($X, $beta, $y)
  20.123 ms (107235 allocations: 11.96 MiB)
julia> @btime nll_threading_inner2($X, $beta, $y)
  22.684 ms (137368 allocations: 12.28 MiB)

When dealing with matrices, you always want to go over the faster index instead of slower ones to increase memory locality. In Julia, the earlier indices are faster, and therefore you would want instead of:

something like:

@view dot(Xi[:, t], beta)

This would require transposing the generation process as well:

to

randn(seed, (nX, Tlen[i]))

This change gives me a substantial improvement in my local machine.

4 Likes

This should be the easiest way to improve. 5x speed up after getting index column by column. Thanks.

julia> @btime nll_threading($X, $beta, $y)
  118.400 μs (109 allocations: 12.33 KiB)
1 Like