Maximizing an objective function with many logs

dadada_dadada · April 13, 2021, 6:47pm

Hi,

I have a conceptually simple objective function (an MLE estimator) that has a lot of logs. This kind of objective function is very common in Statistics and Economics.

For example, consider the following objective function:

L(\theta)=\sum_{i=1}^{N} \sum_{j=0}^{J} y_{i j} \cdot \log \left(\sum_{r=1}^{R} \theta_{r} \cdot s_{i j}^{r}\right)

Where s_{ij}^r are constants and \theta = (\theta_1,...,\theta_R)^\prime are parameters. A MWE of the objective function, the gradient and the Hessian are:

using Random, LinearAlgebra, BenchmarkTools, FiniteDiff

# setup
const N = 2000  # observations
const J = 4     # options
const R = 20    # parameters

# constants
const s = [rand(J+1,R)  for i=1:N]      # s[i][j,r]
for i=1:N, r=1:R
    s[i][:,r] /= sum(s[i][:,r])
end

# weights
y = [Float64.(rand(30:60,J+1)) for i=1:N]

# params
θ = rand(R)
θ /= sum(θ)

# objective function
function obj(
    y::Vector{Vector{Float64}},     # N × J+1
    s::Vector{Matrix{Float64}},     # N × J+1 × R
    θ::Vector{Float64}              # R-
)
    absθ = abs.(θ)
    N = length(s)
    R = length(θ)
    J = length(y[1])-1
    Nt = Threads.nthreads()
    llk = zeros(Float64, Nt)
    Threads.@threads for i = 1:N
        tid = Threads.threadid()
        ss = zeros(Float64, J+1)
        for r=1:R
            ss += absθ[r] .* s[i][:,r]
        end
        @inbounds @simd for j=1:J+1
            llk[tid] += y[i][j]*log(ss[j])
        end
    end
    return sum(llk)/N
end

# gradient
function grad(
    y::Vector{Vector{Float64}},
    s::Vector{Matrix{Float64}},
    θ::Vector{Float64}
)
    absθ = abs.(θ)
    N = length(s)
    R = length(θ)
    J = length(y[1])-1
    Nt = Threads.nthreads()
    dllk = zeros(Float64,R,Nt)
    Threads.@threads for i = 1:N
        tid = Threads.threadid()
        ss = zeros(Float64, J+1)
        for r=1:R
            ss += absθ[r] .* s[i][:,r]
        end
        @inbounds @simd for j=1:J+1
            @views dllk[:,tid] .+= y[i][j] .* s[i][j,:] ./ ss[j]
        end
    end
    return sum(dllk,dims=2)[:,1]/N
end

# hessian
function hess(
    y::Vector{Vector{Float64}},
    s::Vector{Matrix{Float64}},
    θ::Vector{Float64}
)
    absθ = abs.(θ)
    N = length(s)
    R = length(θ)
    J = length(y[1])-1
    Nt = Threads.nthreads()
    hllk = zeros(Float64,R,R,Nt)
    Threads.@threads for i = 1:N
        tid = Threads.threadid()
        ss = zeros(Float64, J+1)
        for r=1:R
            ss += absθ[r] .* s[i][:,r]
        end
        @inbounds @simd for j=1:J+1
            @views hllk[:,:,tid] .+= - y[i][j] .* (s[i][j,:] * s[i][j,:]') ./ ss[j]^2
        end
    end
    return sum(hllk,dims=3)[:,:,1]/N
end


# testing with numerical derivatives
function obj_wrapper(θ)
    return obj(y,s,θ)
end

test_grad = grad(y,s,θ)
num_grad = FiniteDiff.finite_difference_gradient(obj_wrapper,θ)
test_grad ≈ num_grad

test_hess = hess(y,s,θ)
num_hess = FiniteDiff.finite_difference_hessian(obj_wrapper,θ)
sum(abs2,test_hess .- num_hess)

Scaling up a little.

## scaling up

# setup
const N = 300000  # observations
const J = 4     # options
const R = 40    # parameters

# constants
const s = [rand(J+1,R)  for i=1:N]      # s[i][j,r]
for i=1:N, r=1:R
    s[i][:,r] /= sum(s[i][:,r])
end

y = [Float64.(rand(30:60,J+1)) for i=1:N]

# params
θ = rand(R)
θ /= sum(θ)

using BenchmarkTools
@btime obj($y,$s,$θ)
@btime grad($y,$s,$θ)
@btime hess($y,$s,$θ)

  5.456 s (36300025 allocations: 4.33 GiB)
  5.150 s (36300031 allocations: 4.33 GiB)
  32.944 s (37800033 allocations: 22.39 GiB)

This is a very simple example but the performance of real applications of MLE estimators can get very nasty and the logs make it worse.

I wonder what packages and tricks one can use to speed up the optimization of functions like this while coding in Julia?

Thank you all.

stevengj · April 13, 2021, 7:07pm

Don’t allocate in inner loops. You are allocating 3R+1 arrays here.

Since J is small, you might consider using a StaticVector (from StaticArrays.jl) for vectors or length J or J+1 (and use arrays of StaticVectors rather than JxR matrices etcetera.

mcabbott · April 13, 2021, 7:35pm

I think that dealing with big arrays, instead of arrays of arrays, can make this much more efficient.

I’ve followed your formula here, although your code does a few things differently — it has the sum on i inside the log, it takes an absolute value of θ, and it divides by N. These could all be done but you get the idea.

For first derivatives, there is lots of great automatic differentiation technology (and this too likes solid arrays much more than slices). For Hessians, less so.

N = 300000; J = 4; R = 40  # as above
# constants
s = rand(R,J+1,N);      # s[r,j,i]
s ./= sum(s, dims=2);
# weights              # y[j,i]
y = rand(30.0:60.0, J+1, N);

# params
θ = rand(R);
θ ./= sum(θ);  # previous version did not update θ, typo?

using TensorCore # for ⊡, just reshape & *

# objective function
function obj(
    y::AbstractMatrix,    # J+1 × N
    s::AbstractArray,     # R × J+1 × N
    θ::AbstractVector,    # R
)
    θs = θ ⊡ s # contract r, gives θs[j,i]
    sum(y .* log.(θs))
end

@btime obj($y, $s, $θ)  # 36.416 ms (10 allocations: 22.89 MiB)
# was 1.181 s (36300024 allocations: 4.33 GiB) on same computer

using Tullio # , LoopVectorization # this may speed it up further

function obj2(
    y::AbstractMatrix,    # J+1 × N
    s::AbstractArray,     # R × J+1 × N
    θ::AbstractVector,    # R-
)
    θs = θ ⊡ s # contract r, gives θs[j,i], uses BLAS which is hard to beat
    @tullio out := y[j,i] * log(θs[j,i])  # multi-threaded, and low-allocations
end

@btime obj2($y, $s, $θ) # 30.720 ms (101 allocations: 11.45 MiB)

using Zygote

@btime gradient(obj, $y, $s, $θ);  # 305.199 ms (4500131 allocations: 640.87 MiB)
@btime gradient(obj2, $y, $s, $θ); # 219.831 ms (243 allocations: 492.11 MiB)
# @btime grad($y,$s,$θ) was 1.150 s (36300033 allocations: 4.33 GiB) on same computer

Elrod · April 13, 2021, 7:45pm

LoopVectorization should make log much faster, but the contraction takes up almost the entirety of the runtime.

julia> @btime obj2($y, $s, $θ) # without LoopVectorization
  10.606 ms (842 allocations: 11.48 MiB)
-1.090240480666746e8

julia> @btime obj3($y, $s, $θ) # same definition, I just `using LoopVectorization` first
  9.876 ms (843 allocations: 11.48 MiB)
-1.0902404806667463e8

julia> @btime $θ ⊡ $s;
  9.371 ms (8 allocations: 11.44 MiB)

mcabbott · April 13, 2021, 7:54pm

OK! I think this one might match the original code, and no longer just a contraction. But maybe I have one more bug for you…

function obj4(y::AbstractMatrix, s::AbstractArray, θ::AbstractVector)
    N = size(y,2)
    @tullio mid[j] := abs(θ[r]) * s[r,j,i]  
    @tullio tot := y[j,i] * log(mid[j]) / N   # here LV can also help re-order the loop?
end
@btime obj4($y, $s, $θ) # 19.257 ms (146 allocations: 6.50 KiB)
# with LV: UndefVarError: ##vptr##_y##MAX##-6## not defined

And:

function hess(y::AbstractMatrix, s::AbstractArray, θ::AbstractVector) # not carefully checked!
    @tullio mid[j] := abs(θ[r]) * s[r,j,i]
    @tullio hllk[μ,ν] := - y[j,i] * s[μ,j,i] * s[ν,j,i] / (size(s,2) * mid[j]^2)
end
@btime hess($y, $s, $θ); # 613.583 ms (104 allocations: 17.88 KiB)
#  with LoopVectorization: 330.075 ms (103 allocations: 17.48 KiB)
# original: 5.642 s (37800034 allocations: 22.39 GiB) on same computer

dadada_dadada · April 14, 2021, 12:40am

@mcabbott Tullio makes it so fast and elegant, thank you.

Just a thing, I think that there is a typo in obj4, in the original code there is a summation over all r for each j and i separately. It’s not supposed to sum over all i outside the log.

I still need to keep the whole array mid in the memory, any idea about how to avoid that?

function obj5(y::AbstractMatrix, s::AbstractArray, θ::AbstractVector)
    @tullio mid[j,i] := abs(θ[r]) * s[r,j,i]
    @tullio tot := y[j,i] * log(mid[j,i])   # here LV can also help re-order the loop?
end

@btime obj($y, $s, $θ)
@btime obj2($y, $s, $θ)
@btime obj5($y, $s, $θ)
131.714 ms (10 allocations: 22.89 MiB)
116.227 ms (103 allocations: 11.45 MiB)
39.414 ms (147 allocations: 11.45 MiB)

mcabbott · April 14, 2021, 2:21am

Oh I may have mis-read the loops in your first function, sorry, i is outside as you say.

Tullio is limited to one reduction per expression. It can apply a function after the reduction, but this won’t help much here. (It could go further, IIRC Halide lets you chain things lazily, but eventually it becomes a whole language…)

You could try to re-use a smaller array tmp[j] instead of mid[j,i], but this doesn’t seem fast.

Or you can drop down to writing the loops, of course. Maybe this can be made fast. Using @avx here seems to give me a different answer?

function obj6(y::AbstractMatrix, s::AbstractArray, θ::AbstractVector)
    @tullio fin[j,i] := abs(θ[r]) * s[r,j,i] |> y[j,i] * log(_)  # "|> anon" is outside the sum
    sum(fin)
end

function obj7(y::AbstractMatrix, s::AbstractArray, θ::AbstractVector)
    tmp = similar(y, axes(y,1)) .= 0
    out = 0.0
    for i in axes(y,2)
        @tullio tmp[j] = abs(θ[r]) * s[r,j,$i]
        @tullio out += y[j,$i] * log(tmp[j])
    end
    out
end

function obj8(y::AbstractMatrix, s::AbstractArray, θ::AbstractVector)
    out = 0.0
    # @avxt for i in axes(y,2)  # this is faster, but gives a different answer? 
    @inbounds for i in axes(y,2)
        for j in axes(y,1)
            acc = 0.0
            for r in axes(θ,1)
                acc += abs(θ[r]) * s[r,j,i]
            end
            out += y[j,i] * log(acc)  # second reduction
        end
    end
    out
end

@btime obj5($y, $s, $θ)  # 11.560 ms (146 allocations: 11.45 MiB)
@btime obj6($y, $s, $θ)  #  9.272 ms (52 allocations: 11.45 MiB)
@btime obj7($y, $s, $θ)  # 41.004 ms (899490 allocations: 13.73 MiB)
@btime obj8($y, $s, $θ)  # 40.356 ms (0 allocations: 0 bytes)

(In fact @tullio tot := y[j,i] * log(dot(absθ, s[:,j,i])) avx=false seems to work here, but I think this is largely an accident, it’s not tested, and it’s not fast.)

Topic		Replies	Views
Improving performance of a number of functions for large scale gradient based optimization Performance performance , optim , memory-allocation , loops	11	832	November 5, 2021
How to speed up these functions? Performance function	19	1750	August 6, 2021
Optimize benchmark-code Performance question	11	680	December 10, 2019
Speeding up my logsumexp function Performance performance , array , tullio , loopvectorization	35	5122	October 21, 2022
Suggestions needed: speed up optimization Performance	31	2027	October 4, 2020

Maximizing an objective function with many logs

Related topics