Lend a hand on the Hessian of ordinal logistic regression

question

#1

I am trying to port the ordinal logistic regression to a Fisher Scoring implementation. I need to validate the objective function, gradient, and hessian for these purposes. I am struggling with the Hessian and would appreciate some help.

Here is the problem,

using Distributions, FillArrays, LinearAlgebra, Optim, RDatasets, StatsBase,
      StatsModels
# Ordinal Logistic Regression
data = dataset("Ecdat", "Kakadu")[[:RecParks, :Sex, :Age, :Schooling, :Income]]
data.RecParks = categorical(data.RecParks, ordered = true)
formula = @formula(RecParks ~ Sex + Age + Schooling + Income)
terms = StatsModels.Terms(formula)
terms.intercept = false
mf = ModelFrame(terms, data)
mm = ModelMatrix(mf)
X = mm.m
y = model_response(mf)
wts = FrequencyWeights(Ones(y))
l = levels(y)
y = get.(Ref(Dict(zip(l, eachindex(l)))), y, nothing)
@views function f(β)
    η = X * β[1:size(X, 2)]
    thresholds = vcat(-Inf, β[size(X, 2) + 1:end], Inf)
    -sum(wᵢ * log(cdf(Logistic(), thresholds[yᵢ + 1] - ηᵢ) -
                  cdf(Logistic(), thresholds[yᵢ] - ηᵢ))
         for (yᵢ, ηᵢ, wᵢ) in zip(y, η, wts))
end
@views function g!(G, β)
    η = X * β[1:size(X, 2)]
    thresholds = vcat(-Inf, β[size(X, 2) + 1:end], Inf)
    Y₀ = thresholds[y] .- η
    Y₁ = thresholds[y .+ 1] .- η
    F₀ = cdf.(Logistic(), Y₀)
    F₁ = cdf.(Logistic(), Y₁)
    f₀ = replace!(pdf.(Logistic(), Y₀), NaN => 0)
    f₁ = pdf.(Logistic(), Y₁)
    G[1:size(X, 2)] = -(X' * ((f₀ - f₁) ./ (F₁ - F₀)))
    G[size(X, 2) + 1:end] =
        map(x -> -sum((f₁ .* (y .== x) - f₀ .* (y .== (x + 1))) ./ (F₁ - F₀)),
            1:length(β) - size(X, 2))
    G
end
β = append!(zeros(size(X, 2)), range(0, 1, length = length(l) - 1))
td₀ = TwiceDifferentiable(f, g!, β)
β = optimize(td₀, β) |> Optim.minimizer
H = Optim.hessian!(td₀, β)
# Implementation of Hessian
η = X * β[1:size(X, 2)]
thresholds = vcat(-Inf, β[size(X, 2) + 1:end], Inf)
Y₀ = thresholds[y] .- η
Y₁ = thresholds[y .+ 1] .- η
F₀ = cdf.(Logistic(), Y₀)
F₁ = cdf.(Logistic(), Y₁)
f₀ = replace!(pdf.(Logistic(), Y₀), NaN => 0)
f₁ = pdf.(Logistic(), Y₁)
# Up to here same as gradient which is correct
H₀ = zero(H)
b1 = H[1:4,1:4] # Replace with ∂β/∂β
b2 = H[1:4,5:end] # Replace with ∂β/∂α
b3 = H[5:end,5:end] # Replace with ∂α/∂α
H₀[1:4,1:4] = b1
H₀[1:4,5:end] = b2
H₀[5:end,5:end] = b3
H₀ = Hermitian(H₀)
H ≈ H₀ # this should be true

The Hessian is described in McKelvey and Zavoina (1975).


#2

Can’t you just use automatic differentiation?


#3

The idea is to have it part of the Fisher Scoring implementation which uses O’Leary QR variant. I could for the moment just make everything in Iterative Re-weighting Least Squares and plug in the Hessian from auto-diff, but that is sub-optimal… Especially since the IRLS should handle some cases more efficiently. I am using the auto-diff currently to verify my implementations (gradient is validated and I know the correct Hessian). The issue is getting there. I have tried to use the same components I know are correct (since I validated the gradient) to build the Hessian from the notes, but have been unsuccessful.


#4

Bump. I am still looking for help on this. I would be incredibly grateful for help with this. Log-likelihood, gradient, and Hessian are in


McKelvey, Richard D., and William Zavoina. 1975. “A Statistical Model for the Analysis of Ordinal Level Dependent Variables.” The Journal of Mathematical Sociology 4 (1): 103–20. DOI:10.1080/0022250X.1975.9989847).


#5

I misunderstood and was trying to teach Mathematica to calculate these… before I saw (3.8), and realised the question is the next step!

If it were my problem I would try pretty hard to type those formulas in as written, in index notation. Einsum.jl actually has some limited ability to deal with shifts Φ[k+1] but strangely not Φ[k-1]… otherwise you could make a Φ and shifted Φm etc beforehand:

julia> using Einsum

julia> aa = float(collect(1:3))
3-element Array{Float64,1}:
 1.0
 2.0
 3.0

julia> @einsum mm[i,j] := aa[i] / aa[j+1]
3×2 Array{Float64,2}:
 0.5  0.333333
 1.0  0.666667
 1.5  1.0     

julia> @einsum mm[i,j] := aa[i] + aa[j-1]
ERROR: BoundsError: attempt to access 3-element Array{Float64,1} at index [0]

#6

I have tried for instance just verifying the first block ∂ℓℓ∂β which only requires the indicator, pdf, cdf, and features, and even then I have had issues comparing it to the implied Hessian taken as the inverse of the variance covariance estimates from software I am comparing to. Been trying to work block by block as a way to verify understanding of the notation, but have failed so many attempts… If it helps I can post the for loop version for each block and the known answer.


#7

Yes one block at a time sounds good. But if the issue is whether it’s typed in correctly (and written correctly in the paper) I would compare first perhaps to the ForwardDiff.gradient of one component of the gradient?


#8

The other place I found the log-likelihood, gradient, and hessian is

from Maddala, Gangadharrao S. 2008. Limited-Dependent and Qualitative Variables in Econometrics . 1983. Econometric Society Monographs 3. I will post the code I have for the gradient in a bit and the comparison with the auto-diff one.


#9

I will be using the cost function for now and letting auto-diff sort out the gradient and hessian for the time being. I will check up on this again in case someone figures how to code the analytical versions.


#10

I don’t promise to get to this, but could you perhaps include the simplest version of what shape everything is? X, y ,l, etc. In order to play around without installing a dozen packages.


#11
using Distributions: cdf, Logistic, pdf
using Optim: gradient, gradient!, hessian, hessian!, minimizer, optimize, TwiceDifferentiable
using Random: seed!
using Test: @testset, @test

seed!(0)
y = rand(1:4, 1000)
X = hcat(rand(1000), rand(0:1, 1000))
wts = ones(1000)

function f(β)
  cost = zero(Float64)
  η = X * β[1:size(X, 2)]
  ζ = vcat(-Inf, β[size(X, 2) + 1:end], Inf)
  for i ∈ eachindex(η)
    k = y[i]
    Y₀ = clamp(ζ[k] - η[i], -100, 100)
    Y₁ = clamp(ζ[k + 1] - η[i], -100, 100)
    F₀ = cdf(Logistic(), Y₀)
    F₁ = cdf(Logistic(), Y₁)
    cost -= wts[i] * log(F₁ - F₀)
  end
  cost
end

β₀ = vcat(zeros(size(X, 2)), log.(eachindex(unique(y))[2:end]))
td = TwiceDifferentiable(f, β₀)
β = optimize(td, β₀) |> minimizer
Ψ = inv(hessian!(td, β))

@testset "ℓℓ β̂ V̂ with auto-diff" begin
  @test -f(β) ≈ -1385.4955
  @test β ≈ [0.1082102, 0.1113961, -1.00791, 0.1073175, 1.238957] atol = 1e-6
  @test Ψ ≈ [0.03795771	0.00077293	0.01937706	0.01944459	0.01948606
             0.00077293	0.01285767	0.00641446	0.00649767	0.00661532
             0.01937706	0.00641446	0.01809343	0.01544379	0.01461900
             0.01944459	0.00649767	0.01544379	0.01686302	0.01558289
             0.01948606	0.00661532	0.01461900	0.01558289	0.01843883
             ] atol = 1e-4
end

# Implement g!, h!, and fg! ∋
# function g!(G, β)
#   # Equations (3.7)
# 
#   # ∂ℓℓ∂β
# 
#   # ∂ℓℓ∂α
# 
# end
# function h!(H, β)
#   # Equations (3.8)
# 
#   # ∂ℓℓ∂β∂β
# 
#   # ∂ℓℓ∂α∂β
# 
#   # ∂ℓℓ∂α∂α
# 
# end
# function fg!(G, β)
#   # Updates ∇ and returns the cost @ β
# 
# end

# β₀ = vcat(zeros(size(X, 2)), log.(eachindex(unique(y))[2:end]))
# td = TwiceDifferentiable(f, g!, h!, fg!, β₀)
# β = optimize(td, β₀) |> minimizer
# Ψ = inv(Optim.hessian!(td, β))
# 
# @testset "ℓℓ β̂ V̂ with analytical ∇ and Hessian" begin
#   @test -f(β) ≈ -1385.4955
#   @test β ≈ [0.1082102, 0.1113961, -1.00791, 0.1073175, 1.238957] atol = 1e-6
#   @test Ψ ≈ [0.03795771	0.00077293	0.01937706	0.01944459	0.01948606
#              0.00077293	0.01285767	0.00641446	0.00649767	0.00661532
#              0.01937706	0.00641446	0.01809343	0.01544379	0.01461900
#              0.01944459	0.00649767	0.01544379	0.01686302	0.01558289
#              0.01948606	0.00661532	0.01461900	0.01558289	0.01843883
#              ] atol = 1e-4
# end

#12

This is a similar case, but one where auto-diff fails for the information matrix.

using Distributions: cdf, Logistic, pdf
using Optim: gradient, gradient!, hessian, hessian!, minimizer, optimize, TwiceDifferentiable
using Random: seed!
using Test: @testset, @test, @test_broken

seed!(0)
y = rand(1:4, 100)
X = hcat(rand(100), rand(0:1, 100))
wts = ones(100)

function f(β)
  cost = zero(Float64)
  η = X * β[1:size(X, 2)]
  ζ = vcat(-Inf, β[size(X, 2) + 1:end], Inf)
  for i ∈ eachindex(η)
    k = y[i]
    Y₀ = clamp(ζ[k] - η[i], -100, 100)
    Y₁ = clamp(ζ[k + 1] - η[i], -100, 100)
    F₀ = cdf(Logistic(), Y₀)
    F₁ = cdf(Logistic(), Y₁)
    cost -= wts[i] * log(F₁ - F₀)
  end
  cost
end

β₀ = vcat(zeros(size(X, 2)), log.(eachindex(unique(y))[2:end]))
td = TwiceDifferentiable(f, β₀)
β = optimize(td, β₀) |> minimizer
Ψ = inv(hessian!(td, β))

@testset "ℓℓ β̂ V̂" begin
  @test -f(β) ≈ -138.19529 atol = 1e-5
  @test β ≈ [-0.0752212, -0.0187382, -1.200432, -0.1276449, 0.8482401] atol = 1e-6
  @test_broken Ψ ≈ [0.44063026 0.02664232 0.23493455 0.23390820 0.23162855
                    0.02664232 0.13062293 0.08411162 0.08390074 0.08278606
                    0.23493455 0.08411162 0.21796978 0.18992484 0.17929260
                    0.23390820 0.08390074 0.18992484 0.20195565 0.18722638
                    0.23162855 0.08278606 0.17929260 0.18722638 0.20700211] atol = 1e-4
end