How to speed up the calculation of Logit likelihood?

Strange_Xue · September 22, 2023, 1:21pm

Hi, I am currently using Julia to solve some econometric structural models which usually involve the calculation and optimization of Log-likelihood function. While the optimization can be done by Optim.jl, the calculation of loglikelihood needs hand-coding.

The question I face is that, how to calculate the log-likelihood function as fast as possible especially facing the unbalanced panel data as the input. The code below giving a logit implementation has the inputs X and y in which each individual has different number of observations, so I use Vector{Matrix{Float64}} to save X and use Vector{Vector{Int64}} to save y.

I wonder if this is an efficient way to calculate loglikehood function nll_threading? Is there any other techniques to handle such problems ? (PS: I already used many methods such as multi-threading, views to speed up; however it’s still slow especially when the loglikelihood function for each individual has a more complicated form, such as hidden Markov model, bayesian learning model)

using LogExpFunctions, Optim
using Random, Distributions
using LinearAlgebra:dot
using BenchmarkTools
const seed = Random.seed!(2022 - 1 - 3)
# data generation
begin
    Inds = 1000 # number of individuals
    nX = 4
    Tlen = rand([15, 10, 20], Inds) # number of observations of each individual
    X = Vector{Matrix{Float64}}()
    for i in eachindex(Tlen)
        push!(X, randn(seed, (Tlen[i], nX)))
    end
    beta = [1.0, 2, -3, 4]
    y = Vector{Vector{Int64}}()
    for i in 1:Inds
        Xi = X[i]
        yi = Vector{Int64}()
        for t in axes(Xi, 1)
            U = @views dot(Xi[t, :], beta)
            push!(yi, rand(BernoulliLogit(-U)))
        end
        push!(y, yi)
    end
end
# negative log-likelihood function (to optimize)
function nll_threading(X, beta::Vector{T}, y) where {T <: Real}
    ind_ll = zeros(T, Threads.nthreads())
    Threads.@threads for i in eachindex(y)
        Xi = X[i]
        yi = y[i] 
        for t in axes(Xi, 1)
            U = @views dot(Xi[t, :], beta)
            ind_ll[Threads.threadid()] += logpdf(BernoulliLogit(-U), yi[t])
        end
    end
    return - sum(ind_ll)
end
# benchmark
@benchmark nll_threading($X, $beta, $y)
# about 380ms for 6 threads with intel i5-10400f chip

# optimization
fit_optim = optimize(
    pars -> nll_threading(X, pars, y),
    zeros(4),
    Newton();
    autodiff = :forward
)
fit_optim.minimizer

Benchmark for current version

julia> @btime nll_threading($X, $beta, $y)
  551.400 μs (111 allocations: 12.39 KiB)

Oscar_Smith · September 22, 2023, 1:29pm

I think it might help to make X be a Matrix{SVector{4}} and beta be an SVector{4}. Doing so will make it so the dot product calculation knows ahead of time that it will be getting 4 long vectors. Another likely 2x speedup is switching to Float32 which should have enough accuracy but will take 2x less memory.

Strange_Xue · September 23, 2023, 5:47am

Another Question : I also face another weird problem. If I run it in my local machine with 6 threads (i5 10400f, 6 physical processors), the CPU usage is high, over 80%; However, if run it in the remote windows server with 18 threads (i9 10980XE, 18 physical processors), the CPU usage is relatively low, at 40% approximately. It seems Julia didn’t utilize all threads as I asked for. I already set number of threads to 18 while starting the REPL.

Strange_Xue · September 23, 2023, 5:50am

Thanks. It seems helpful, I will post the benchmark results after implementing your ideas.

jonathanBieler · September 23, 2023, 9:24am

For some problems you can avoid computing the likelihood by grouping observations, e.g. :

D = Poisson(100)
x = rand(D,1000)
xc = countmap(x)

sum(logpdf(D,xi)*n for (xi,n) in xc) #56 iterations instead of 1000

But I guess this won’t help in your case, unless maybe if you round off U to some decimals.

mcreel · September 23, 2023, 9:44am

Perhaps putting Threads.@threads before the inner loop,

for t in axes(Xi, 1)

might be better.

Strange_Xue · September 23, 2023, 10:55am

Might be worse.

# two implementations: one use vector to save each individual's likelihood, 
# the other use lock to avoid data race in multi-threading. 
# Both performances are worse than original version.
function nll_threading_inner1(X, beta::Vector{T}, y) where {T <: Real}
    ll = zero(T)
    @inbounds for i in eachindex(y)
        Xi = X[i]
        yi = y[i] 
        lli = zeros(T, length(yi))
        Threads.@threads for t in axes(Xi, 1)
            U = @views dot(Xi[t, :], beta)
            lli[t] = logpdf(BernoulliLogit(-U), yi[t])
        end
        ll += sum(lli)
    end
    return - ll
end

function nll_threading_inner2(X, beta::Vector{T}, y) where {T <: Real}
    ll = zero(T)
    l = ReentrantLock()
    @inbounds for i in eachindex(y)
        Xi = X[i]
        yi = y[i] 
        Threads.@threads for t in axes(Xi, 1)
            U = @views dot(Xi[t, :], beta)
            lock(l) do
                ll += logpdf(BernoulliLogit(-U), yi[t])
            end
        end
    end
    return - ll
end

Benchmark results

julia> @btime nll_threading_inner1($X, $beta, $y)
  20.123 ms (107235 allocations: 11.96 MiB)
julia> @btime nll_threading_inner2($X, $beta, $y)
  22.684 ms (137368 allocations: 12.28 MiB)

Dan · September 23, 2023, 11:16am

When dealing with matrices, you always want to go over the faster index instead of slower ones to increase memory locality. In Julia, the earlier indices are faster, and therefore you would want instead of:

something like:

@view dot(Xi[:, t], beta)

This would require transposing the generation process as well:

to

randn(seed, (nX, Tlen[i]))

This change gives me a substantial improvement in my local machine.

Strange_Xue · September 23, 2023, 11:41am

This should be the easiest way to improve. 5x speed up after getting index column by column. Thanks.

julia> @btime nll_threading($X, $beta, $y)
  118.400 μs (109 allocations: 12.33 KiB)

Topic		Replies	Views
Maximizing an objective function with many logs Performance tullio	6	723	April 14, 2021
Sum(log.(p * C)) is 2 to 4 times slower than in NumPy Performance python , matrices	19	1344	January 26, 2021
How to speed up these functions? Performance function	19	1740	August 6, 2021
Optimize benchmark-code Performance question	11	680	December 10, 2019
Why is this optimization problem so slow in Julia? In R it took 60s Optimization (Mathematical)	31	1981	May 22, 2021

How to speed up the calculation of Logit likelihood?

Related topics