Maximizing an objective function with many logs

Hi,

I have a conceptually simple objective function (an MLE estimator) that has a lot of logs. This kind of objective function is very common in Statistics and Economics.

For example, consider the following objective function:

L(\theta)=\sum_{i=1}^{N} \sum_{j=0}^{J} y_{i j} \cdot \log \left(\sum_{r=1}^{R} \theta_{r} \cdot s_{i j}^{r}\right)

Where s_{ij}^r are constants and \theta = (\theta_1,...,\theta_R)^\prime are parameters. A MWE of the objective function, the gradient and the Hessian are:

using Random, LinearAlgebra, BenchmarkTools, FiniteDiff

# setup
const N = 2000  # observations
const J = 4     # options
const R = 20    # parameters

# constants
const s = [rand(J+1,R)  for i=1:N]      # s[i][j,r]
for i=1:N, r=1:R
    s[i][:,r] /= sum(s[i][:,r])
end

# weights
y = [Float64.(rand(30:60,J+1)) for i=1:N]

# params
θ = rand(R)
θ /= sum(θ)

# objective function
function obj(
    y::Vector{Vector{Float64}},     # N × J+1
    s::Vector{Matrix{Float64}},     # N × J+1 × R
    θ::Vector{Float64}              # R-
)
    absθ = abs.(θ)
    N = length(s)
    R = length(θ)
    J = length(y[1])-1
    Nt = Threads.nthreads()
    llk = zeros(Float64, Nt)
    Threads.@threads for i = 1:N
        tid = Threads.threadid()
        ss = zeros(Float64, J+1)
        for r=1:R
            ss += absθ[r] .* s[i][:,r]
        end
        @inbounds @simd for j=1:J+1
            llk[tid] += y[i][j]*log(ss[j])
        end
    end
    return sum(llk)/N
end

# gradient
function grad(
    y::Vector{Vector{Float64}},
    s::Vector{Matrix{Float64}},
    θ::Vector{Float64}
)
    absθ = abs.(θ)
    N = length(s)
    R = length(θ)
    J = length(y[1])-1
    Nt = Threads.nthreads()
    dllk = zeros(Float64,R,Nt)
    Threads.@threads for i = 1:N
        tid = Threads.threadid()
        ss = zeros(Float64, J+1)
        for r=1:R
            ss += absθ[r] .* s[i][:,r]
        end
        @inbounds @simd for j=1:J+1
            @views dllk[:,tid] .+= y[i][j] .* s[i][j,:] ./ ss[j]
        end
    end
    return sum(dllk,dims=2)[:,1]/N
end

# hessian
function hess(
    y::Vector{Vector{Float64}},
    s::Vector{Matrix{Float64}},
    θ::Vector{Float64}
)
    absθ = abs.(θ)
    N = length(s)
    R = length(θ)
    J = length(y[1])-1
    Nt = Threads.nthreads()
    hllk = zeros(Float64,R,R,Nt)
    Threads.@threads for i = 1:N
        tid = Threads.threadid()
        ss = zeros(Float64, J+1)
        for r=1:R
            ss += absθ[r] .* s[i][:,r]
        end
        @inbounds @simd for j=1:J+1
            @views hllk[:,:,tid] .+= - y[i][j] .* (s[i][j,:] * s[i][j,:]') ./ ss[j]^2
        end
    end
    return sum(hllk,dims=3)[:,:,1]/N
end


# testing with numerical derivatives
function obj_wrapper(θ)
    return obj(y,s,θ)
end

test_grad = grad(y,s,θ)
num_grad = FiniteDiff.finite_difference_gradient(obj_wrapper,θ)
test_grad ≈ num_grad

test_hess = hess(y,s,θ)
num_hess = FiniteDiff.finite_difference_hessian(obj_wrapper,θ)
sum(abs2,test_hess .- num_hess)

Scaling up a little.

## scaling up

# setup
const N = 300000  # observations
const J = 4     # options
const R = 40    # parameters

# constants
const s = [rand(J+1,R)  for i=1:N]      # s[i][j,r]
for i=1:N, r=1:R
    s[i][:,r] /= sum(s[i][:,r])
end

y = [Float64.(rand(30:60,J+1)) for i=1:N]

# params
θ = rand(R)
θ /= sum(θ)

using BenchmarkTools
@btime obj($y,$s,$θ)
@btime grad($y,$s,$θ)
@btime hess($y,$s,$θ)
  5.456 s (36300025 allocations: 4.33 GiB)
  5.150 s (36300031 allocations: 4.33 GiB)
  32.944 s (37800033 allocations: 22.39 GiB)

This is a very simple example but the performance of real applications of MLE estimators can get very nasty and the logs make it worse.

I wonder what packages and tricks one can use to speed up the optimization of functions like this while coding in Julia?

Thank you all.

Don’t allocate in inner loops. You are allocating 3R+1 arrays here.

Since J is small, you might consider using a StaticVector (from StaticArrays.jl) for vectors or length J or J+1 (and use arrays of StaticVectors rather than JxR matrices etcetera.

3 Likes

I think that dealing with big arrays, instead of arrays of arrays, can make this much more efficient.

I’ve followed your formula here, although your code does a few things differently — it has the sum on i inside the log, it takes an absolute value of θ, and it divides by N. These could all be done but you get the idea.

For first derivatives, there is lots of great automatic differentiation technology (and this too likes solid arrays much more than slices). For Hessians, less so.

N = 300000; J = 4; R = 40  # as above
# constants
s = rand(R,J+1,N);      # s[r,j,i]
s ./= sum(s, dims=2);
# weights              # y[j,i]
y = rand(30.0:60.0, J+1, N);

# params
θ = rand(R);
θ ./= sum(θ);  # previous version did not update θ, typo?

using TensorCore # for ⊡, just reshape & *

# objective function
function obj(
    y::AbstractMatrix,    # J+1 × N
    s::AbstractArray,     # R × J+1 × N
    θ::AbstractVector,    # R
)
    θs = θ ⊡ s # contract r, gives θs[j,i]
    sum(y .* log.(θs))
end

@btime obj($y, $s, $θ)  # 36.416 ms (10 allocations: 22.89 MiB)
# was 1.181 s (36300024 allocations: 4.33 GiB) on same computer

using Tullio # , LoopVectorization # this may speed it up further

function obj2(
    y::AbstractMatrix,    # J+1 × N
    s::AbstractArray,     # R × J+1 × N
    θ::AbstractVector,    # R-
)
    θs = θ ⊡ s # contract r, gives θs[j,i], uses BLAS which is hard to beat
    @tullio out := y[j,i] * log(θs[j,i])  # multi-threaded, and low-allocations
end

@btime obj2($y, $s, $θ) # 30.720 ms (101 allocations: 11.45 MiB)

using Zygote

@btime gradient(obj, $y, $s, $θ);  # 305.199 ms (4500131 allocations: 640.87 MiB)
@btime gradient(obj2, $y, $s, $θ); # 219.831 ms (243 allocations: 492.11 MiB)
# @btime grad($y,$s,$θ) was 1.150 s (36300033 allocations: 4.33 GiB) on same computer
2 Likes

LoopVectorization should make log much faster, but the contraction takes up almost the entirety of the runtime.

julia> @btime obj2($y, $s, $θ) # without LoopVectorization
  10.606 ms (842 allocations: 11.48 MiB)
-1.090240480666746e8

julia> @btime obj3($y, $s, $θ) # same definition, I just `using LoopVectorization` first
  9.876 ms (843 allocations: 11.48 MiB)
-1.0902404806667463e8

julia> @btime $θ ⊡ $s;
  9.371 ms (8 allocations: 11.44 MiB)
2 Likes

OK! I think this one might match the original code, and no longer just a contraction. But maybe I have one more bug for you…

function obj4(y::AbstractMatrix, s::AbstractArray, θ::AbstractVector)
    N = size(y,2)
    @tullio mid[j] := abs(θ[r]) * s[r,j,i]  
    @tullio tot := y[j,i] * log(mid[j]) / N   # here LV can also help re-order the loop?
end
@btime obj4($y, $s, $θ) # 19.257 ms (146 allocations: 6.50 KiB)
# with LV: UndefVarError: ##vptr##_y##MAX##-6## not defined

And:

function hess(y::AbstractMatrix, s::AbstractArray, θ::AbstractVector) # not carefully checked!
    @tullio mid[j] := abs(θ[r]) * s[r,j,i]
    @tullio hllk[μ,ν] := - y[j,i] * s[μ,j,i] * s[ν,j,i] / (size(s,2) * mid[j]^2)
end
@btime hess($y, $s, $θ); # 613.583 ms (104 allocations: 17.88 KiB)
#  with LoopVectorization: 330.075 ms (103 allocations: 17.48 KiB)
# original: 5.642 s (37800034 allocations: 22.39 GiB) on same computer
1 Like

@mcabbott Tullio makes it so fast and elegant, thank you.

Just a thing, I think that there is a typo in obj4, in the original code there is a summation over all r for each j and i separately. It’s not supposed to sum over all i outside the log.

I still need to keep the whole array mid in the memory, any idea about how to avoid that?

function obj5(y::AbstractMatrix, s::AbstractArray, θ::AbstractVector)
    @tullio mid[j,i] := abs(θ[r]) * s[r,j,i]
    @tullio tot := y[j,i] * log(mid[j,i])   # here LV can also help re-order the loop?
end
@btime obj($y, $s, $θ)
@btime obj2($y, $s, $θ)
@btime obj5($y, $s, $θ)
131.714 ms (10 allocations: 22.89 MiB)
116.227 ms (103 allocations: 11.45 MiB)
39.414 ms (147 allocations: 11.45 MiB)

Oh I may have mis-read the loops in your first function, sorry, i is outside as you say.

Tullio is limited to one reduction per expression. It can apply a function after the reduction, but this won’t help much here. (It could go further, IIRC Halide lets you chain things lazily, but eventually it becomes a whole language…)

You could try to re-use a smaller array tmp[j] instead of mid[j,i], but this doesn’t seem fast.

Or you can drop down to writing the loops, of course. Maybe this can be made fast. Using @avx here seems to give me a different answer?

function obj6(y::AbstractMatrix, s::AbstractArray, θ::AbstractVector)
    @tullio fin[j,i] := abs(θ[r]) * s[r,j,i] |> y[j,i] * log(_)  # "|> anon" is outside the sum
    sum(fin)
end

function obj7(y::AbstractMatrix, s::AbstractArray, θ::AbstractVector)
    tmp = similar(y, axes(y,1)) .= 0
    out = 0.0
    for i in axes(y,2)
        @tullio tmp[j] = abs(θ[r]) * s[r,j,$i]
        @tullio out += y[j,$i] * log(tmp[j])
    end
    out
end

function obj8(y::AbstractMatrix, s::AbstractArray, θ::AbstractVector)
    out = 0.0
    # @avxt for i in axes(y,2)  # this is faster, but gives a different answer? 
    @inbounds for i in axes(y,2)
        for j in axes(y,1)
            acc = 0.0
            for r in axes(θ,1)
                acc += abs(θ[r]) * s[r,j,i]
            end
            out += y[j,i] * log(acc)  # second reduction
        end
    end
    out
end

@btime obj5($y, $s, $θ)  # 11.560 ms (146 allocations: 11.45 MiB)
@btime obj6($y, $s, $θ)  #  9.272 ms (52 allocations: 11.45 MiB)
@btime obj7($y, $s, $θ)  # 41.004 ms (899490 allocations: 13.73 MiB)
@btime obj8($y, $s, $θ)  # 40.356 ms (0 allocations: 0 bytes)

(In fact @tullio tot := y[j,i] * log(dot(absθ, s[:,j,i])) avx=false seems to work here, but I think this is largely an accident, it’s not tested, and it’s not fast.)

1 Like