Inconsistency between standard errors from Hessian and Gradient in ForwardDiff/Optim

I have been using Optim package with AD recently, and I am having some strange behaviour with the gradient. I am using Optim and NLSolversBase in the context of Maximum Likelihood Estimation, and passing a TwiceDifferentiable to the function, optimizing it, and then calculating the Hessian at the optimum gives correct standard errors. However, If I calculate the numerical gradient and try to obtain the standard errors, I get wildly incorrect results. Here is MWE:

using Distributions, Optim, NLSolversBase, LinearAlgebra, DataFrames, GLM, Random
#Data Generating Process
Random.seed!(123)
x = rand(Normal(),5000)
epsilon = rand(Normal(),5000)
y = 1 .+ 2.5 .* x + epsilon .> 0

#Probit Function
function probit(theta, data)
    res = data[:,1:end-1]*theta
    q = 2 .* data[:,end] .- 1
    ll = cdf.(Normal(),q .* res)
    LL = -sum(log.(ll))
end

#data and initial guess
thetap = [.5; 1.5]
datap = [ones(5000) x y]

#setup optimization
prob = TwiceDifferentiable(vars -> probit(vars, datap), thetap; autodiff = :forward)
prest = optimize(prob, thetap, Newton(), Optim.Options(show_trace=true))

#correct
a = sqrt.(diag(inv(hessian!(prob,Optim.minimizer(prest)))))

#no idea. I get a noninvertible matrix from the product of gradients.
gra = gradient!(prob,Optim.minimizer(prest))
b = sqrt.(diag(inv(gra*gra')))

#check GLM package
df = DataFrame(X =x, Y=y)
GLMprobit = glm(@formula(Y ~ X), df, Binomial(), ProbitLink())

Now, the hessian should be the same/similar as the outer product of the gradient at the optimum from MLE theory. Am I missing something from the theory? Or am I using the packages wrong?

1 Like

gra as you calculate it here is the gradient of the objective function at the estimate, which is very close to being a vector of zeros, as it should be. Thus, its outer product is a matrix of zeros. To compute the covariance estimator, you need to average the outer product of each observation’s contribution to the gradient. This is explained in the document Econometrics/econometrics.pdf at main · mcreel/Econometrics · GitHub, page 488 and several following.

An example file is https://github.com/mcreel/Econometrics/blob/master/Examples/MEPS-I/EstimatePoisson.jl, where you can explore the effect of using different covariance estimators by changing the vc option on the last line.

Methods for computing the covariance are here: https://github.com/mcreel/Econometrics.jl/blob/master/src/ML/mle.jl

2 Likes

I do not know the internals of Optim that well, but could it be possible that it returns the algorithm’s (eg BFGS) approximation of the Hessian instead?

Hi mcreel and Tamas, thanks for the input. I do understand the econometrics of MLE, and I realized that what Optim is returning me is the estimate of the expected value of the gradient, ie, N^{-1}\sum_i s_i(\hat{\theta}) = E[s(\theta)] = 0, and what I need is N^{-1}\sum_i s_i(\hat{\theta})s_i(\hat{\theta})' = E[s(\theta)s(\theta)']

The examples that were given are only relevant for finite difference approximations and symbolic differentiation, not for Automatic Differentiation.

I understand that given that I have the ability to use AD, there isn’t much added approximation cost of obtaining either the gradient or the Hessian, but in cases like partial/pooled MLE in panel data where the estimator of the asymptotic variance can’t be simplified due to serial correlation, we need both estimates of the expected value of the Hessian and of the outer product of the gradients. Should I rely on finite difference approximations in these cases?

For extremum estimators in general, you need to use the sandwich form, using both the Hessian and the covariance of the gradient contributions. You don’t have the simplification that happens when doing ML estimation, as you note. The Hessian is straightforward, but the way the covariance of the gradient contribution is computed depends on whether or not the contributions are serially correlated, as you also note. Nevertheless, these are theoretical issues. For computation of the estimated covariance matrix, the way the Hessian and gradients are obtained, analytically, by AD, or by finite difference, is a different issue. You could use any one of those three options, and you should get essentially the same numerical results, up to a bit of approximation error when using finite differences. When the data is properly scaled and the objective function is reasonably smooth, finite differencing (FD) is pretty accurate, and the approximation errors are unimportant, compared to the measurement accuracy of the data itself (usually).

Use of AD versus FD is probably of more importance during the optimization process itself, where better accuracy could lead to a quicker solution. But, I think that once the solution is found, the choice of AD versus FD for covariance estimation is not very important: there’s little time to be saved, and little accuracy to be gained. So, whatever’s more convenient would be my choice.

An interesting reference:

4 Likes

By the way, here’s your code, but also using the ML code I referenced above.

using Distributions, Optim, NLSolversBase, LinearAlgebra, DataFrames, GLM, Random
#Data Generating Process
Random.seed!(123)
x = rand(Normal(),5000)
epsilon = rand(Normal(),5000)
y = 1 .+ 2.5 .* x + epsilon .> 0

#Probit Function
function probit(theta, data)
    res = data[:,1:end-1]*theta
    q = 2 .* data[:,end] .- 1
    ll = cdf.(Normal(),q .* res)
    LL = -sum(log.(ll))
end


function probit2(theta, data)
    res = data[:,1:end-1]*theta
    q = 2 .* data[:,end] .- 1
    ll = cdf.(Normal(),q .* res)
    LL = log.(ll)
end


#data and initial guess
thetap = [.5; 1.5]
datap = [ones(5000) x y]

#setup optimization
prob = TwiceDifferentiable(vars -> probit(vars, datap), thetap; autodiff = :forward)
prest = optimize(prob, thetap, Newton(), Optim.Options(show_trace=true))

#correct
a = sqrt.(diag(inv(hessian!(prob,Optim.minimizer(prest)))))
display(a)
#no idea. I get a noninvertible matrix from the product of gradients.
gra = gradient!(prob,Optim.minimizer(prest))
#display(gra*gra')
#b = sqrt.(diag(inv(gra*gra')))

#check GLM package
df = DataFrame(X =x, Y=y)
GLMprobit = glm(@formula(Y ~ X), df, Binomial(), ProbitLink())
display(GLMprobit)

# check mleresults()
m = theta-> probit2(theta, datap)
mleresults(m, thetap, vc=1)
mleresults(m, thetap, vc=2)
mleresults(m, thetap, vc=3)

Running this gives the following output. It appears that GLM must be using the inverse Hessian form for computing the covariance. The slight difference in results must be due to difference convergence tolerances, different ways in which Hessians are computed, etc. But, the numbers are the same for all practical purposes.

Note that the key to estimating the information matrix is to have the model available without the summing or averaging (the probit2 function).

julia> include("ML.jl");
Iter     Function value   Gradient norm 
     0     1.424836e+03     3.508441e+02
     1     1.272355e+03     1.255745e+02
     2     1.261903e+03     4.520994e+00
     3     1.261890e+03     4.146793e-03
     4     1.261890e+03     6.025036e-09
2-element Array{Float64,1}:
 0.036472878945357355
 0.06771710417344215 
StatsModels.DataFrameRegressionModel{GeneralizedLinearModel{GlmResp{Array{Float64,1},Binomial{Float64},ProbitLink},DensePredChol{Float64,Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: Y ~ 1 + X

Coefficients:
             Estimate Std.Error z value Pr(>|z|)
(Intercept)   1.00567 0.0364882 27.5614   <1e-99
X             2.42153 0.0674308 35.9113   <1e-99

************************************************************
MLE Estimation Results    Convergence: true
Average Log-L: -0.25238   Observations: 5000
Sandwich form covariance estimator

                estimate     st. err      t-stat     p-value
           1     1.00567     0.03648    27.56426     0.00000
           2     2.42153     0.06568    36.87032     0.00000

Information Criteria
                   Crit.      Crit/n
       CAIC   2542.81478     0.50856
        BIC   2540.81478     0.50816
        AIC   2527.78039     0.50556
************************************************************
************************************************************
MLE Estimation Results    Convergence: true
Average Log-L: -0.25238   Observations: 5000
Inverse Hessian form covariance estimator

                estimate     st. err      t-stat     p-value
           1     1.00567     0.03647    27.57314     0.00000
           2     2.42153     0.06772    35.75961     0.00000

Information Criteria
                   Crit.      Crit/n
       CAIC   2542.81478     0.50856
        BIC   2540.81478     0.50816
        AIC   2527.78039     0.50556
************************************************************
************************************************************
MLE Estimation Results    Convergence: true
Average Log-L: -0.25238   Observations: 5000
OPG form covariance estimator

                estimate     st. err      t-stat     p-value
           1     1.00567     0.03647    27.57348     0.00000
           2     2.42153     0.06984    34.67112     0.00000

Information Criteria
                   Crit.      Crit/n
       CAIC   2542.81478     0.50856
        BIC   2540.81478     0.50816
        AIC   2527.78039     0.50556
************************************************************

julia> 

Thanks mcreel, that is an interesting reference!