Lend a hand on the Hessian of ordinal logistic regression

I am trying to port the ordinal logistic regression to a Fisher Scoring implementation. I need to validate the objective function, gradient, and hessian for these purposes. I am struggling with the Hessian and would appreciate some help.

Here is the problem,

using Distributions, FillArrays, LinearAlgebra, Optim, RDatasets, StatsBase,
      StatsModels
# Ordinal Logistic Regression
data = dataset("Ecdat", "Kakadu")[[:RecParks, :Sex, :Age, :Schooling, :Income]]
data.RecParks = categorical(data.RecParks, ordered = true)
formula = @formula(RecParks ~ Sex + Age + Schooling + Income)
terms = StatsModels.Terms(formula)
terms.intercept = false
mf = ModelFrame(terms, data)
mm = ModelMatrix(mf)
X = mm.m
y = model_response(mf)
wts = FrequencyWeights(Ones(y))
l = levels(y)
y = get.(Ref(Dict(zip(l, eachindex(l)))), y, nothing)
@views function f(β)
    η = X * β[1:size(X, 2)]
    thresholds = vcat(-Inf, β[size(X, 2) + 1:end], Inf)
    -sum(wᵢ * log(cdf(Logistic(), thresholds[yᵢ + 1] - ηᵢ) -
                  cdf(Logistic(), thresholds[yᵢ] - ηᵢ))
         for (yᵢ, ηᵢ, wᵢ) in zip(y, η, wts))
end
@views function g!(G, β)
    η = X * β[1:size(X, 2)]
    thresholds = vcat(-Inf, β[size(X, 2) + 1:end], Inf)
    Y₀ = thresholds[y] .- η
    Y₁ = thresholds[y .+ 1] .- η
    F₀ = cdf.(Logistic(), Y₀)
    F₁ = cdf.(Logistic(), Y₁)
    f₀ = replace!(pdf.(Logistic(), Y₀), NaN => 0)
    f₁ = pdf.(Logistic(), Y₁)
    G[1:size(X, 2)] = -(X' * ((f₀ - f₁) ./ (F₁ - F₀)))
    G[size(X, 2) + 1:end] =
        map(x -> -sum((f₁ .* (y .== x) - f₀ .* (y .== (x + 1))) ./ (F₁ - F₀)),
            1:length(β) - size(X, 2))
    G
end
β = append!(zeros(size(X, 2)), range(0, 1, length = length(l) - 1))
td₀ = TwiceDifferentiable(f, g!, β)
β = optimize(td₀, β) |> Optim.minimizer
H = Optim.hessian!(td₀, β)
# Implementation of Hessian
η = X * β[1:size(X, 2)]
thresholds = vcat(-Inf, β[size(X, 2) + 1:end], Inf)
Y₀ = thresholds[y] .- η
Y₁ = thresholds[y .+ 1] .- η
F₀ = cdf.(Logistic(), Y₀)
F₁ = cdf.(Logistic(), Y₁)
f₀ = replace!(pdf.(Logistic(), Y₀), NaN => 0)
f₁ = pdf.(Logistic(), Y₁)
# Up to here same as gradient which is correct
H₀ = zero(H)
b1 = H[1:4,1:4] # Replace with ∂β/∂β
b2 = H[1:4,5:end] # Replace with ∂β/∂α
b3 = H[5:end,5:end] # Replace with ∂α/∂α
H₀[1:4,1:4] = b1
H₀[1:4,5:end] = b2
H₀[5:end,5:end] = b3
H₀ = Hermitian(H₀)
H ≈ H₀ # this should be true

The Hessian is described in McKelvey and Zavoina (1975).

1 Like

Can’t you just use automatic differentiation?

The idea is to have it part of the Fisher Scoring implementation which uses O’Leary QR variant. I could for the moment just make everything in Iterative Re-weighting Least Squares and plug in the Hessian from auto-diff, but that is sub-optimal… Especially since the IRLS should handle some cases more efficiently. I am using the auto-diff currently to verify my implementations (gradient is validated and I know the correct Hessian). The issue is getting there. I have tried to use the same components I know are correct (since I validated the gradient) to build the Hessian from the notes, but have been unsuccessful.

Bump. I am still looking for help on this. I would be incredibly grateful for help with this. Log-likelihood, gradient, and Hessian are in


McKelvey, Richard D., and William Zavoina. 1975. “A Statistical Model for the Analysis of Ordinal Level Dependent Variables.” The Journal of Mathematical Sociology 4 (1): 103–20. DOI:10.1080/0022250X.1975.9989847).

I misunderstood and was trying to teach Mathematica to calculate these… before I saw (3.8), and realised the question is the next step!

If it were my problem I would try pretty hard to type those formulas in as written, in index notation. Einsum.jl actually has some limited ability to deal with shifts Φ[k+1] but strangely not Φ[k-1]… otherwise you could make a Φ and shifted Φm etc beforehand:

julia> using Einsum

julia> aa = float(collect(1:3))
3-element Array{Float64,1}:
 1.0
 2.0
 3.0

julia> @einsum mm[i,j] := aa[i] / aa[j+1]
3×2 Array{Float64,2}:
 0.5  0.333333
 1.0  0.666667
 1.5  1.0     

julia> @einsum mm[i,j] := aa[i] + aa[j-1]
ERROR: BoundsError: attempt to access 3-element Array{Float64,1} at index [0]

I have tried for instance just verifying the first block ∂ℓℓ∂β which only requires the indicator, pdf, cdf, and features, and even then I have had issues comparing it to the implied Hessian taken as the inverse of the variance covariance estimates from software I am comparing to. Been trying to work block by block as a way to verify understanding of the notation, but have failed so many attempts… If it helps I can post the for loop version for each block and the known answer.

Yes one block at a time sounds good. But if the issue is whether it’s typed in correctly (and written correctly in the paper) I would compare first perhaps to the ForwardDiff.gradient of one component of the gradient?

The other place I found the log-likelihood, gradient, and hessian is

from Maddala, Gangadharrao S. 2008. Limited-Dependent and Qualitative Variables in Econometrics . 1983. Econometric Society Monographs 3. I will post the code I have for the gradient in a bit and the comparison with the auto-diff one.

I will be using the cost function for now and letting auto-diff sort out the gradient and hessian for the time being. I will check up on this again in case someone figures how to code the analytical versions.

I don’t promise to get to this, but could you perhaps include the simplest version of what shape everything is? X, y ,l, etc. In order to play around without installing a dozen packages.

using Distributions: cdf, Logistic, pdf
using Optim: gradient, gradient!, hessian, hessian!, minimizer, optimize, TwiceDifferentiable
using Random: seed!
using Test: @testset, @test

seed!(0)
y = rand(1:4, 1000)
X = hcat(rand(1000), rand(0:1, 1000))
wts = ones(1000)

function f(β)
  cost = zero(Float64)
  η = X * β[1:size(X, 2)]
  ζ = vcat(-Inf, β[size(X, 2) + 1:end], Inf)
  for i ∈ eachindex(η)
    k = y[i]
    Y₀ = clamp(ζ[k] - η[i], -100, 100)
    Y₁ = clamp(ζ[k + 1] - η[i], -100, 100)
    F₀ = cdf(Logistic(), Y₀)
    F₁ = cdf(Logistic(), Y₁)
    cost -= wts[i] * log(F₁ - F₀)
  end
  cost
end

β₀ = vcat(zeros(size(X, 2)), log.(eachindex(unique(y))[2:end]))
td = TwiceDifferentiable(f, β₀)
β = optimize(td, β₀) |> minimizer
Ψ = inv(hessian!(td, β))

@testset "ℓℓ β̂ V̂ with auto-diff" begin
  @test -f(β) ≈ -1385.4955
  @test β ≈ [0.1082102, 0.1113961, -1.00791, 0.1073175, 1.238957] atol = 1e-6
  @test Ψ ≈ [0.03795771	0.00077293	0.01937706	0.01944459	0.01948606
             0.00077293	0.01285767	0.00641446	0.00649767	0.00661532
             0.01937706	0.00641446	0.01809343	0.01544379	0.01461900
             0.01944459	0.00649767	0.01544379	0.01686302	0.01558289
             0.01948606	0.00661532	0.01461900	0.01558289	0.01843883
             ] atol = 1e-4
end

# Implement g!, h!, and fg! ∋
# function g!(G, β)
#   # Equations (3.7)
# 
#   # ∂ℓℓ∂β
# 
#   # ∂ℓℓ∂α
# 
# end
# function h!(H, β)
#   # Equations (3.8)
# 
#   # ∂ℓℓ∂β∂β
# 
#   # ∂ℓℓ∂α∂β
# 
#   # ∂ℓℓ∂α∂α
# 
# end
# function fg!(G, β)
#   # Updates ∇ and returns the cost @ β
# 
# end

# β₀ = vcat(zeros(size(X, 2)), log.(eachindex(unique(y))[2:end]))
# td = TwiceDifferentiable(f, g!, h!, fg!, β₀)
# β = optimize(td, β₀) |> minimizer
# Ψ = inv(Optim.hessian!(td, β))
# 
# @testset "ℓℓ β̂ V̂ with analytical ∇ and Hessian" begin
#   @test -f(β) ≈ -1385.4955
#   @test β ≈ [0.1082102, 0.1113961, -1.00791, 0.1073175, 1.238957] atol = 1e-6
#   @test Ψ ≈ [0.03795771	0.00077293	0.01937706	0.01944459	0.01948606
#              0.00077293	0.01285767	0.00641446	0.00649767	0.00661532
#              0.01937706	0.00641446	0.01809343	0.01544379	0.01461900
#              0.01944459	0.00649767	0.01544379	0.01686302	0.01558289
#              0.01948606	0.00661532	0.01461900	0.01558289	0.01843883
#              ] atol = 1e-4
# end

This is a similar case, but one where auto-diff fails for the information matrix.

using Distributions: cdf, Logistic, pdf
using Optim: gradient, gradient!, hessian, hessian!, minimizer, optimize, TwiceDifferentiable
using Random: seed!
using Test: @testset, @test, @test_broken

seed!(0)
y = rand(1:4, 100)
X = hcat(rand(100), rand(0:1, 100))
wts = ones(100)

function f(β)
  cost = zero(Float64)
  η = X * β[1:size(X, 2)]
  ζ = vcat(-Inf, β[size(X, 2) + 1:end], Inf)
  for i ∈ eachindex(η)
    k = y[i]
    Y₀ = clamp(ζ[k] - η[i], -100, 100)
    Y₁ = clamp(ζ[k + 1] - η[i], -100, 100)
    F₀ = cdf(Logistic(), Y₀)
    F₁ = cdf(Logistic(), Y₁)
    cost -= wts[i] * log(F₁ - F₀)
  end
  cost
end

β₀ = vcat(zeros(size(X, 2)), log.(eachindex(unique(y))[2:end]))
td = TwiceDifferentiable(f, β₀)
β = optimize(td, β₀) |> minimizer
Ψ = inv(hessian!(td, β))

@testset "ℓℓ β̂ V̂" begin
  @test -f(β) ≈ -138.19529 atol = 1e-5
  @test β ≈ [-0.0752212, -0.0187382, -1.200432, -0.1276449, 0.8482401] atol = 1e-6
  @test_broken Ψ ≈ [0.44063026 0.02664232 0.23493455 0.23390820 0.23162855
                    0.02664232 0.13062293 0.08411162 0.08390074 0.08278606
                    0.23493455 0.08411162 0.21796978 0.18992484 0.17929260
                    0.23390820 0.08390074 0.18992484 0.20195565 0.18722638
                    0.23162855 0.08278606 0.17929260 0.18722638 0.20700211] atol = 1e-4
end

It appears that I am late to the party. I was wondering if you have by any chance resolved that issue with Hessian matrix? I have the exact same problem (only difference is that I’ve been programming ordinal logistic regression from scratch in R). My question is how we can compute blocks of Hessian matrix as the formulas (3.8) give us a vector as a result ?

I did get an updated source that has an alternative definition but haven’t had the time to test it out. I believe https://core.ac.uk/reader/4268580 (from 2014 rather than 1975 so it is more legible) pages 18-19 have one that seems correct. It is in my TODO list but please don’t let that stop anyone from beating me to it.

1 Like