Lend a hand on the Hessian of ordinal logistic regression

Nosferican · December 19, 2018, 10:42pm

I am trying to port the ordinal logistic regression to a Fisher Scoring implementation. I need to validate the objective function, gradient, and hessian for these purposes. I am struggling with the Hessian and would appreciate some help.

Here is the problem,

using Distributions, FillArrays, LinearAlgebra, Optim, RDatasets, StatsBase,
      StatsModels
# Ordinal Logistic Regression
data = dataset("Ecdat", "Kakadu")[[:RecParks, :Sex, :Age, :Schooling, :Income]]
data.RecParks = categorical(data.RecParks, ordered = true)
formula = @formula(RecParks ~ Sex + Age + Schooling + Income)
terms = StatsModels.Terms(formula)
terms.intercept = false
mf = ModelFrame(terms, data)
mm = ModelMatrix(mf)
X = mm.m
y = model_response(mf)
wts = FrequencyWeights(Ones(y))
l = levels(y)
y = get.(Ref(Dict(zip(l, eachindex(l)))), y, nothing)
@views function f(β)
    η = X * β[1:size(X, 2)]
    thresholds = vcat(-Inf, β[size(X, 2) + 1:end], Inf)
    -sum(wᵢ * log(cdf(Logistic(), thresholds[yᵢ + 1] - ηᵢ) -
                  cdf(Logistic(), thresholds[yᵢ] - ηᵢ))
         for (yᵢ, ηᵢ, wᵢ) in zip(y, η, wts))
end
@views function g!(G, β)
    η = X * β[1:size(X, 2)]
    thresholds = vcat(-Inf, β[size(X, 2) + 1:end], Inf)
    Y₀ = thresholds[y] .- η
    Y₁ = thresholds[y .+ 1] .- η
    F₀ = cdf.(Logistic(), Y₀)
    F₁ = cdf.(Logistic(), Y₁)
    f₀ = replace!(pdf.(Logistic(), Y₀), NaN => 0)
    f₁ = pdf.(Logistic(), Y₁)
    G[1:size(X, 2)] = -(X' * ((f₀ - f₁) ./ (F₁ - F₀)))
    G[size(X, 2) + 1:end] =
        map(x -> -sum((f₁ .* (y .== x) - f₀ .* (y .== (x + 1))) ./ (F₁ - F₀)),
            1:length(β) - size(X, 2))
    G
end
β = append!(zeros(size(X, 2)), range(0, 1, length = length(l) - 1))
td₀ = TwiceDifferentiable(f, g!, β)
β = optimize(td₀, β) |> Optim.minimizer
H = Optim.hessian!(td₀, β)
# Implementation of Hessian
η = X * β[1:size(X, 2)]
thresholds = vcat(-Inf, β[size(X, 2) + 1:end], Inf)
Y₀ = thresholds[y] .- η
Y₁ = thresholds[y .+ 1] .- η
F₀ = cdf.(Logistic(), Y₀)
F₁ = cdf.(Logistic(), Y₁)
f₀ = replace!(pdf.(Logistic(), Y₀), NaN => 0)
f₁ = pdf.(Logistic(), Y₁)
# Up to here same as gradient which is correct
H₀ = zero(H)
b1 = H[1:4,1:4] # Replace with ∂β/∂β
b2 = H[1:4,5:end] # Replace with ∂β/∂α
b3 = H[5:end,5:end] # Replace with ∂α/∂α
H₀[1:4,1:4] = b1
H₀[1:4,5:end] = b2
H₀[5:end,5:end] = b3
H₀ = Hermitian(H₀)
H ≈ H₀ # this should be true

The Hessian is described in McKelvey and Zavoina (1975).

dpsanders · December 19, 2018, 11:20pm

Can’t you just use automatic differentiation?

Nosferican · December 19, 2018, 11:24pm

The idea is to have it part of the Fisher Scoring implementation which uses O’Leary QR variant. I could for the moment just make everything in Iterative Re-weighting Least Squares and plug in the Hessian from auto-diff, but that is sub-optimal… Especially since the IRLS should handle some cases more efficiently. I am using the auto-diff currently to verify my implementations (gradient is validated and I know the correct Hessian). The issue is getting there. I have tried to use the same components I know are correct (since I validated the gradient) to build the Hessian from the notes, but have been unsuccessful.

Nosferican · March 18, 2019, 4:46am

Bump. I am still looking for help on this. I would be incredibly grateful for help with this. Log-likelihood, gradient, and Hessian are in

McKelvey, Richard D., and William Zavoina. 1975. “A Statistical Model for the Analysis of Ordinal Level Dependent Variables.” The Journal of Mathematical Sociology 4 (1): 103–20. DOI:10.1080/0022250X.1975.9989847).

mcabbott · March 18, 2019, 1:40pm

I misunderstood and was trying to teach Mathematica to calculate these… before I saw (3.8), and realised the question is the next step!

If it were my problem I would try pretty hard to type those formulas in as written, in index notation. Einsum.jl actually has some limited ability to deal with shifts Φ[k+1] but strangely not Φ[k-1]… otherwise you could make a Φ and shifted Φm etc beforehand:

julia> using Einsum

julia> aa = float(collect(1:3))
3-element Array{Float64,1}:
 1.0
 2.0
 3.0

julia> @einsum mm[i,j] := aa[i] / aa[j+1]
3×2 Array{Float64,2}:
 0.5  0.333333
 1.0  0.666667
 1.5  1.0     

julia> @einsum mm[i,j] := aa[i] + aa[j-1]
ERROR: BoundsError: attempt to access 3-element Array{Float64,1} at index [0]

Nosferican · March 18, 2019, 1:47pm

I have tried for instance just verifying the first block ∂ℓℓ∂β which only requires the indicator, pdf, cdf, and features, and even then I have had issues comparing it to the implied Hessian taken as the inverse of the variance covariance estimates from software I am comparing to. Been trying to work block by block as a way to verify understanding of the notation, but have failed so many attempts… If it helps I can post the for loop version for each block and the known answer.

mcabbott · March 18, 2019, 2:58pm

Yes one block at a time sounds good. But if the issue is whether it’s typed in correctly (and written correctly in the paper) I would compare first perhaps to the ForwardDiff.gradient of one component of the gradient?

Nosferican · March 18, 2019, 3:10pm

The other place I found the log-likelihood, gradient, and hessian is

from Maddala, Gangadharrao S. 2008. Limited-Dependent and Qualitative Variables in Econometrics . 1983. Econometric Society Monographs 3. I will post the code I have for the gradient in a bit and the comparison with the auto-diff one.

Nosferican · March 18, 2019, 9:02pm

I will be using the cost function for now and letting auto-diff sort out the gradient and hessian for the time being. I will check up on this again in case someone figures how to code the analytical versions.

mcabbott · March 18, 2019, 9:13pm

I don’t promise to get to this, but could you perhaps include the simplest version of what shape everything is? X, y ,l, etc. In order to play around without installing a dozen packages.

Nosferican · March 19, 2019, 12:11am

using Distributions: cdf, Logistic, pdf
using Optim: gradient, gradient!, hessian, hessian!, minimizer, optimize, TwiceDifferentiable
using Random: seed!
using Test: @testset, @test

seed!(0)
y = rand(1:4, 1000)
X = hcat(rand(1000), rand(0:1, 1000))
wts = ones(1000)

function f(β)
  cost = zero(Float64)
  η = X * β[1:size(X, 2)]
  ζ = vcat(-Inf, β[size(X, 2) + 1:end], Inf)
  for i ∈ eachindex(η)
    k = y[i]
    Y₀ = clamp(ζ[k] - η[i], -100, 100)
    Y₁ = clamp(ζ[k + 1] - η[i], -100, 100)
    F₀ = cdf(Logistic(), Y₀)
    F₁ = cdf(Logistic(), Y₁)
    cost -= wts[i] * log(F₁ - F₀)
  end
  cost
end

β₀ = vcat(zeros(size(X, 2)), log.(eachindex(unique(y))[2:end]))
td = TwiceDifferentiable(f, β₀)
β = optimize(td, β₀) |> minimizer
Ψ = inv(hessian!(td, β))

@testset "ℓℓ β̂ V̂ with auto-diff" begin
  @test -f(β) ≈ -1385.4955
  @test β ≈ [0.1082102, 0.1113961, -1.00791, 0.1073175, 1.238957] atol = 1e-6
  @test Ψ ≈ [0.03795771	0.00077293	0.01937706	0.01944459	0.01948606
             0.00077293	0.01285767	0.00641446	0.00649767	0.00661532
             0.01937706	0.00641446	0.01809343	0.01544379	0.01461900
             0.01944459	0.00649767	0.01544379	0.01686302	0.01558289
             0.01948606	0.00661532	0.01461900	0.01558289	0.01843883
             ] atol = 1e-4
end

# Implement g!, h!, and fg! ∋
# function g!(G, β)
#   # Equations (3.7)
# 
#   # ∂ℓℓ∂β
# 
#   # ∂ℓℓ∂α
# 
# end
# function h!(H, β)
#   # Equations (3.8)
# 
#   # ∂ℓℓ∂β∂β
# 
#   # ∂ℓℓ∂α∂β
# 
#   # ∂ℓℓ∂α∂α
# 
# end
# function fg!(G, β)
#   # Updates ∇ and returns the cost @ β
# 
# end

# β₀ = vcat(zeros(size(X, 2)), log.(eachindex(unique(y))[2:end]))
# td = TwiceDifferentiable(f, g!, h!, fg!, β₀)
# β = optimize(td, β₀) |> minimizer
# Ψ = inv(Optim.hessian!(td, β))
# 
# @testset "ℓℓ β̂ V̂ with analytical ∇ and Hessian" begin
#   @test -f(β) ≈ -1385.4955
#   @test β ≈ [0.1082102, 0.1113961, -1.00791, 0.1073175, 1.238957] atol = 1e-6
#   @test Ψ ≈ [0.03795771	0.00077293	0.01937706	0.01944459	0.01948606
#              0.00077293	0.01285767	0.00641446	0.00649767	0.00661532
#              0.01937706	0.00641446	0.01809343	0.01544379	0.01461900
#              0.01944459	0.00649767	0.01544379	0.01686302	0.01558289
#              0.01948606	0.00661532	0.01461900	0.01558289	0.01843883
#              ] atol = 1e-4
# end

Nosferican · March 19, 2019, 12:48am

This is a similar case, but one where auto-diff fails for the information matrix.

using Distributions: cdf, Logistic, pdf
using Optim: gradient, gradient!, hessian, hessian!, minimizer, optimize, TwiceDifferentiable
using Random: seed!
using Test: @testset, @test, @test_broken

seed!(0)
y = rand(1:4, 100)
X = hcat(rand(100), rand(0:1, 100))
wts = ones(100)

function f(β)
  cost = zero(Float64)
  η = X * β[1:size(X, 2)]
  ζ = vcat(-Inf, β[size(X, 2) + 1:end], Inf)
  for i ∈ eachindex(η)
    k = y[i]
    Y₀ = clamp(ζ[k] - η[i], -100, 100)
    Y₁ = clamp(ζ[k + 1] - η[i], -100, 100)
    F₀ = cdf(Logistic(), Y₀)
    F₁ = cdf(Logistic(), Y₁)
    cost -= wts[i] * log(F₁ - F₀)
  end
  cost
end

β₀ = vcat(zeros(size(X, 2)), log.(eachindex(unique(y))[2:end]))
td = TwiceDifferentiable(f, β₀)
β = optimize(td, β₀) |> minimizer
Ψ = inv(hessian!(td, β))

@testset "ℓℓ β̂ V̂" begin
  @test -f(β) ≈ -138.19529 atol = 1e-5
  @test β ≈ [-0.0752212, -0.0187382, -1.200432, -0.1276449, 0.8482401] atol = 1e-6
  @test_broken Ψ ≈ [0.44063026 0.02664232 0.23493455 0.23390820 0.23162855
                    0.02664232 0.13062293 0.08411162 0.08390074 0.08278606
                    0.23493455 0.08411162 0.21796978 0.18992484 0.17929260
                    0.23390820 0.08390074 0.18992484 0.20195565 0.18722638
                    0.23162855 0.08278606 0.17929260 0.18722638 0.20700211] atol = 1e-4
end

bbellweather · May 16, 2020, 1:49pm

It appears that I am late to the party. I was wondering if you have by any chance resolved that issue with Hessian matrix? I have the exact same problem (only difference is that I’ve been programming ordinal logistic regression from scratch in R). My question is how we can compute blocks of Hessian matrix as the formulas (3.8) give us a vector as a result ?

Nosferican · May 16, 2020, 2:47pm

I did get an updated source that has an alternative definition but haven’t had the time to test it out. I believe Topics in ordinal logistic regression and its applications - CORE Reader (from 2014 rather than 1975 so it is more legible) pages 18-19 have one that seems correct. It is in my TODO list but please don’t let that stop anyone from beating me to it.

Topic		Replies	Views
Inconsistency between standard errors from Hessian and Gradient in ForwardDiff/Optim Optimization (Mathematical) optim	6	2491	October 26, 2018
Highly Asymmetric Hessian after optimization Optimization (Mathematical)	26	1684	July 6, 2019
Hessian matrix for standard errors using Optim.jl example Statistics question	1	2964	November 27, 2018
Improving performance of a number of functions for large scale gradient based optimization Performance performance , optim , memory-allocation , loops	11	874	November 5, 2021
Optim.jl - get the hessian matrix from maximum likelihood estimation Optimization (Mathematical) question	6	1288	August 3, 2022

Lend a hand on the Hessian of ordinal logistic regression

Related topics