 # Lend a hand on the Hessian of ordinal logistic regression

#1

I am trying to port the ordinal logistic regression to a Fisher Scoring implementation. I need to validate the objective function, gradient, and hessian for these purposes. I am struggling with the Hessian and would appreciate some help.

Here is the problem,

``````using Distributions, FillArrays, LinearAlgebra, Optim, RDatasets, StatsBase,
StatsModels
# Ordinal Logistic Regression
data = dataset("Ecdat", "Kakadu")[[:RecParks, :Sex, :Age, :Schooling, :Income]]
data.RecParks = categorical(data.RecParks, ordered = true)
formula = @formula(RecParks ~ Sex + Age + Schooling + Income)
terms = StatsModels.Terms(formula)
terms.intercept = false
mf = ModelFrame(terms, data)
mm = ModelMatrix(mf)
X = mm.m
y = model_response(mf)
wts = FrequencyWeights(Ones(y))
l = levels(y)
y = get.(Ref(Dict(zip(l, eachindex(l)))), y, nothing)
@views function f(β)
η = X * β[1:size(X, 2)]
thresholds = vcat(-Inf, β[size(X, 2) + 1:end], Inf)
-sum(wᵢ * log(cdf(Logistic(), thresholds[yᵢ + 1] - ηᵢ) -
cdf(Logistic(), thresholds[yᵢ] - ηᵢ))
for (yᵢ, ηᵢ, wᵢ) in zip(y, η, wts))
end
@views function g!(G, β)
η = X * β[1:size(X, 2)]
thresholds = vcat(-Inf, β[size(X, 2) + 1:end], Inf)
Y₀ = thresholds[y] .- η
Y₁ = thresholds[y .+ 1] .- η
F₀ = cdf.(Logistic(), Y₀)
F₁ = cdf.(Logistic(), Y₁)
f₀ = replace!(pdf.(Logistic(), Y₀), NaN => 0)
f₁ = pdf.(Logistic(), Y₁)
G[1:size(X, 2)] = -(X' * ((f₀ - f₁) ./ (F₁ - F₀)))
G[size(X, 2) + 1:end] =
map(x -> -sum((f₁ .* (y .== x) - f₀ .* (y .== (x + 1))) ./ (F₁ - F₀)),
1:length(β) - size(X, 2))
G
end
β = append!(zeros(size(X, 2)), range(0, 1, length = length(l) - 1))
td₀ = TwiceDifferentiable(f, g!, β)
β = optimize(td₀, β) |> Optim.minimizer
H = Optim.hessian!(td₀, β)
# Implementation of Hessian
η = X * β[1:size(X, 2)]
thresholds = vcat(-Inf, β[size(X, 2) + 1:end], Inf)
Y₀ = thresholds[y] .- η
Y₁ = thresholds[y .+ 1] .- η
F₀ = cdf.(Logistic(), Y₀)
F₁ = cdf.(Logistic(), Y₁)
f₀ = replace!(pdf.(Logistic(), Y₀), NaN => 0)
f₁ = pdf.(Logistic(), Y₁)
# Up to here same as gradient which is correct
H₀ = zero(H)
b1 = H[1:4,1:4] # Replace with ∂β/∂β
b2 = H[1:4,5:end] # Replace with ∂β/∂α
b3 = H[5:end,5:end] # Replace with ∂α/∂α
H₀[1:4,1:4] = b1
H₀[1:4,5:end] = b2
H₀[5:end,5:end] = b3
H₀ = Hermitian(H₀)
H ≈ H₀ # this should be true
``````

The Hessian is described in McKelvey and Zavoina (1975).

#2

Can’t you just use automatic differentiation?

#3

The idea is to have it part of the Fisher Scoring implementation which uses O’Leary QR variant. I could for the moment just make everything in Iterative Re-weighting Least Squares and plug in the Hessian from auto-diff, but that is sub-optimal… Especially since the IRLS should handle some cases more efficiently. I am using the auto-diff currently to verify my implementations (gradient is validated and I know the correct Hessian). The issue is getting there. I have tried to use the same components I know are correct (since I validated the gradient) to build the Hessian from the notes, but have been unsuccessful.

#4

Bump. I am still looking for help on this. I would be incredibly grateful for help with this. Log-likelihood, gradient, and Hessian are in

McKelvey, Richard D., and William Zavoina. 1975. “A Statistical Model for the Analysis of Ordinal Level Dependent Variables.” The Journal of Mathematical Sociology 4 (1): 103–20. DOI:10.1080/0022250X.1975.9989847).

#5

I misunderstood and was trying to teach Mathematica to calculate these… before I saw (3.8), and realised the question is the next step!

If it were my problem I would try pretty hard to type those formulas in as written, in index notation. Einsum.jl actually has some limited ability to deal with shifts `Φ[k+1]` but strangely not `Φ[k-1]`… otherwise you could make a `Φ` and shifted `Φm` etc beforehand:

``````julia> using Einsum

julia> aa = float(collect(1:3))
3-element Array{Float64,1}:
1.0
2.0
3.0

julia> @einsum mm[i,j] := aa[i] / aa[j+1]
3×2 Array{Float64,2}:
0.5  0.333333
1.0  0.666667
1.5  1.0

julia> @einsum mm[i,j] := aa[i] + aa[j-1]
ERROR: BoundsError: attempt to access 3-element Array{Float64,1} at index 
``````

#6

I have tried for instance just verifying the first block `∂ℓℓ∂β` which only requires the indicator, pdf, cdf, and features, and even then I have had issues comparing it to the implied Hessian taken as the inverse of the variance covariance estimates from software I am comparing to. Been trying to work block by block as a way to verify understanding of the notation, but have failed so many attempts… If it helps I can post the for loop version for each block and the known answer.

#7

Yes one block at a time sounds good. But if the issue is whether it’s typed in correctly (and written correctly in the paper) I would compare first perhaps to the `ForwardDiff.gradient` of one component of the gradient?

#8

The other place I found the log-likelihood, gradient, and hessian is

from Maddala, Gangadharrao S. 2008. Limited-Dependent and Qualitative Variables in Econometrics . 1983. Econometric Society Monographs 3. I will post the code I have for the gradient in a bit and the comparison with the auto-diff one.

#9

I will be using the cost function for now and letting auto-diff sort out the gradient and hessian for the time being. I will check up on this again in case someone figures how to code the analytical versions.

#10

I don’t promise to get to this, but could you perhaps include the simplest version of what shape everything is? `X`, `y` ,`l`, etc. In order to play around without installing a dozen packages.

#11
``````using Distributions: cdf, Logistic, pdf
using Optim: gradient, gradient!, hessian, hessian!, minimizer, optimize, TwiceDifferentiable
using Random: seed!
using Test: @testset, @test

seed!(0)
y = rand(1:4, 1000)
X = hcat(rand(1000), rand(0:1, 1000))
wts = ones(1000)

function f(β)
cost = zero(Float64)
η = X * β[1:size(X, 2)]
ζ = vcat(-Inf, β[size(X, 2) + 1:end], Inf)
for i ∈ eachindex(η)
k = y[i]
Y₀ = clamp(ζ[k] - η[i], -100, 100)
Y₁ = clamp(ζ[k + 1] - η[i], -100, 100)
F₀ = cdf(Logistic(), Y₀)
F₁ = cdf(Logistic(), Y₁)
cost -= wts[i] * log(F₁ - F₀)
end
cost
end

β₀ = vcat(zeros(size(X, 2)), log.(eachindex(unique(y))[2:end]))
td = TwiceDifferentiable(f, β₀)
β = optimize(td, β₀) |> minimizer
Ψ = inv(hessian!(td, β))

@testset "ℓℓ β̂ V̂ with auto-diff" begin
@test -f(β) ≈ -1385.4955
@test β ≈ [0.1082102, 0.1113961, -1.00791, 0.1073175, 1.238957] atol = 1e-6
@test Ψ ≈ [0.03795771	0.00077293	0.01937706	0.01944459	0.01948606
0.00077293	0.01285767	0.00641446	0.00649767	0.00661532
0.01937706	0.00641446	0.01809343	0.01544379	0.01461900
0.01944459	0.00649767	0.01544379	0.01686302	0.01558289
0.01948606	0.00661532	0.01461900	0.01558289	0.01843883
] atol = 1e-4
end

# Implement g!, h!, and fg! ∋
# function g!(G, β)
#   # Equations (3.7)
#
#   # ∂ℓℓ∂β
#
#   # ∂ℓℓ∂α
#
# end
# function h!(H, β)
#   # Equations (3.8)
#
#   # ∂ℓℓ∂β∂β
#
#   # ∂ℓℓ∂α∂β
#
#   # ∂ℓℓ∂α∂α
#
# end
# function fg!(G, β)
#   # Updates ∇ and returns the cost @ β
#
# end

# β₀ = vcat(zeros(size(X, 2)), log.(eachindex(unique(y))[2:end]))
# td = TwiceDifferentiable(f, g!, h!, fg!, β₀)
# β = optimize(td, β₀) |> minimizer
# Ψ = inv(Optim.hessian!(td, β))
#
# @testset "ℓℓ β̂ V̂ with analytical ∇ and Hessian" begin
#   @test -f(β) ≈ -1385.4955
#   @test β ≈ [0.1082102, 0.1113961, -1.00791, 0.1073175, 1.238957] atol = 1e-6
#   @test Ψ ≈ [0.03795771	0.00077293	0.01937706	0.01944459	0.01948606
#              0.00077293	0.01285767	0.00641446	0.00649767	0.00661532
#              0.01937706	0.00641446	0.01809343	0.01544379	0.01461900
#              0.01944459	0.00649767	0.01544379	0.01686302	0.01558289
#              0.01948606	0.00661532	0.01461900	0.01558289	0.01843883
#              ] atol = 1e-4
# end
``````

#12

This is a similar case, but one where auto-diff fails for the information matrix.

``````using Distributions: cdf, Logistic, pdf
using Optim: gradient, gradient!, hessian, hessian!, minimizer, optimize, TwiceDifferentiable
using Random: seed!
using Test: @testset, @test, @test_broken

seed!(0)
y = rand(1:4, 100)
X = hcat(rand(100), rand(0:1, 100))
wts = ones(100)

function f(β)
cost = zero(Float64)
η = X * β[1:size(X, 2)]
ζ = vcat(-Inf, β[size(X, 2) + 1:end], Inf)
for i ∈ eachindex(η)
k = y[i]
Y₀ = clamp(ζ[k] - η[i], -100, 100)
Y₁ = clamp(ζ[k + 1] - η[i], -100, 100)
F₀ = cdf(Logistic(), Y₀)
F₁ = cdf(Logistic(), Y₁)
cost -= wts[i] * log(F₁ - F₀)
end
cost
end

β₀ = vcat(zeros(size(X, 2)), log.(eachindex(unique(y))[2:end]))
td = TwiceDifferentiable(f, β₀)
β = optimize(td, β₀) |> minimizer
Ψ = inv(hessian!(td, β))

@testset "ℓℓ β̂ V̂" begin
@test -f(β) ≈ -138.19529 atol = 1e-5
@test β ≈ [-0.0752212, -0.0187382, -1.200432, -0.1276449, 0.8482401] atol = 1e-6
@test_broken Ψ ≈ [0.44063026 0.02664232 0.23493455 0.23390820 0.23162855
0.02664232 0.13062293 0.08411162 0.08390074 0.08278606
0.23493455 0.08411162 0.21796978 0.18992484 0.17929260
0.23390820 0.08390074 0.18992484 0.20195565 0.18722638
0.23162855 0.08278606 0.17929260 0.18722638 0.20700211] atol = 1e-4
end
``````