Optim.jl --- Do all Methods Allow Box Constraints? Should all Work Without Them?

I find that really hard to believe on a size 1000 problem. When you say it’s “spent in linesearch” are you sure that it’s not just because linesearch code calls your objective? What I’m getting at is that almost always, it’s the objective cost that dominates this stuff. Now, you may say it’s doing too many iterations, but maybe it’s just aiming for a higher accuracy than you’re interested in? See below.

What do you mean “with nested for loops”? Something like a grid search? I’m asking because I’m curious as to how you determine that your non-Optim solution is “fine”. Do you look at gradients, or are you eyeballing that it’s the same solution?

Also, if you’re using LBFGS but you’re not providing gradients, it’s defaulting to finite differencing which might be very slow here.

2 Likes

The cost defiitely takes time to call, I know that a cholesky factorization could speed it up by a factor of 2 but for the moment it just return an errorwhen I try, so i have to look into it.

And yes with nested loop I meant grid search. What I did was trying to reproduce the content of a paper, and one of the authors provided a python implementation of the algorithm where he does a grid search on approximately 12x12 points, but his code is quite slow when one wants to apply to a “real” analysis which is why I thought I could try it in Julia.

I’d like to provided the gradients, I think it is possible but it involves a bit of algebra and will cost me a lot of time.

Oh, sorry. I don̈́’t know why I thought it was a thousand dimensions. So you only have two parameters?

To help further, I’d need to see some more code. Feel free to share here or in private.

1 Like

I have a parameter that should be strictly positive. The unconstrained optimizer attempts negative values in the search, and I am trying to force it to focus on only nonnegative values by adding lower and upper bounds.
So I changed
optimize(θ -> loglike(θ, r, x, y, R, c, d, p0, B), lower, upper, theta, BFGS(), Optim.Options(g_tol = 1e-1, allow_f_increases = true, iterations=100_000, show_trace=true, show_every=1))
to
optimize(θ -> loglike(θ, r, x, y, R, c, d, p0, B), lower, upper, theta, Fminbox(GradientDescent()), Optim.Options(g_tol = 1e-1, allow_f_increases = true, iterations=100_000, show_trace=true, show_every=1))
The program runs, but the gradient seems to be oscillating so it never converges. Is this related to GradientDescent?

You don’t need lower and upper bounds to constrain a parameter like that.

If you want a parameter that’s always strictly positive, why not re-parametrize it? For example, if you want to estimate \sigma > 0, you can just let \sigma= exp(ln(\sigma)) instead. Now, call log(\sigma) = \gamma and estimate \gamma instead. Then with your estimated value of \gamma take exponentials to recover back \sigma. With this trick, you can simply use unconstrained optimization.

2 Likes

This is a nice idea and I indeed considered it. My only concern about it is that I am running a maximum likelihood estimation and would like to have standard errors for estimated parameters. Does the log transformation affect the interpretation of standard errors?

Yes, you can just use the delta method for calculating standard errors.

1 Like

re: the delta method, here’s an example: https://github.com/mcreel/Econometrics/blob/master/Examples/Restrictions/ExampleDeltaMethod.jl

1 Like

Oh you are right!
I redefined my positive parameters as their log versions. Now I run into an unexpected problem. Since I have a numerical integration in the objective function, when I call the optimization, I got

ERROR: TaskFailedException:
DomainError with 0.0:
integrand produced NaN in the interval (-4.0, 4.0)

My guess is that exp() sends some large attempted parameter values to Inf during the search?


Update: there is something weird about it. When I close VScode and restart again, the log-transformed program works fine. :wink:


Update: hmmm no, I still occasionally get the error, even after the optimization smoothly running for a while

Can you post your code? If it doesn’t take super long to run, I can try running it on my end.

1 Like

Thanks, this is very kind of you. Please see my code below:

using Pkg, Optim, GLM, Random, Statistics, Distributions, QuadGK, StatsFuns, ForwardDiff
Random.seed!(1234)

function choice_prob_naive(θ::Array{Float64,1}, r::Float64, x::Array{Float64,1}, y::Array{Float64,1}, R::Int64, c::Float64, d::Int64, p0::Float64)
    α = θ[1]
    βX = θ[2:5]
    βY = θ[6:10]
    γ0 = θ[11]
    γ1 = θ[12:16]
    logτ = θ[17] 
    logκ = θ[18]
    s(ϵs) = d + 1.0/sqrt(exp(logτ)) * ϵs                                           
    p(ϵs) = p0/(p0 + (1.0 - p0) * exp((0.5 - s(ϵs))*exp(logτ)))           
    Eu(ϵs)  = α*r + x'*βX + y'*βY + 0.05*α*R*r - ((γ0 + y'*γ1)*(1.0 - c) + 0.05*α*(1.0 + r - c))*p(ϵs) - (logκ - 1.0)        
    integrand(ϵs) = normcdf(0.0, 1.0, Eu(ϵs))*(1/sqrt(2.0*π)*exp(-(ϵs^2.0)/2.0)) 
    (expectation, err) = quadgk(ϵ -> integrand(ϵ), -4.0, 4.0, rtol = 0.0001)
    return expectation
end

function loglike(θ::Array{Float64,1}, r::Array{Float64,1}, x::Array{Float64,2}, y::Array{Float64,2}, R::Array{Int64,1}, c::Array{Float64,1}, d::Array{Int64,1}, p0::Array{Float64,1}, B::Array{Int64,1})
    prob = zeros(size(B,1))
    Threads.@threads for i = 1:size(B,1)    
        prob[i] = choice_prob_naive(θ, r[i], x[i,:], y[i,:], R[i], c[i], d[i], p0[i])
    end
    negloglike = - (B'*log.(prob) + (1.0 .- B)'*log.(1.0 .- prob))  
    return negloglike
end

td = TwiceDifferentiable(θ -> loglike(θ, r, x, y, R, c, d, p0, B), theta; autodiff = :forward);
@time theta_naive = optimize(td, theta, BFGS(), Optim.Options(f_tol=10^-3, iterations=100_000, show_trace=true, show_every=1))

The below creates a random sample. You can adjust the size with N, which is around 1 million in my actual sample.

N = 100000;
r = 400 .+ 200*randn(N);
x = [6 .+ 1*randn(N)  rand(Binomial(1,0.15), N) 0.6 .+ 0.6*randn(N) 7 .+ 0.5*randn(N)];
y = [0.5 .+ 0.2*randn(N) 2.8 .+ 2.2*randn(N) rand(Binomial(1,0.12), N)  7 .+ 8.0*randn(N) 1.3 .+ 6.0*randn(N)];
R = rand(Binomial(1,0.1), N);
c = 0.5 .+ 0.2*randn(N);
d = rand(Binomial(1,0.03), N);
p0 = 0.5 .+ 0.2*randn(N);
B = rand(Binomial(1,0.3), N);

# initial value
alpha = 0.001;betaX = [0.02; 0.1;0.01;0.02]; betaY = [0.05; -0.01; -0.1; 0.25; 0.11]; gamma0 = 0.05; gamma1 = [0.1; -0.05; -0.0; -0.1; -0.1]; tau = 1; kappa = 2; 
theta = [alpha; betaX; betaY; gamma0; gamma1; tau; kappa]

I also checked the NaN issue. It seems to occur when the search attempts very small value of parameters. For example:

choice_prob_naive(-5000*theta, r[1], x[1,:], y[1,:], R[1], c[1], d[1], p0[1]) 
ERROR: DomainError with 0.0:
integrand produced NaN in the interval (-4.0, 4.0)

Update: I found the issue is that when τ approaches zero, for example when logτ = -1000, exp(logτ) returns exactly zero, instead of a tiny positive value. In this case, s(ϵ) becomes Inf and p(ϵ) becomes NaN.
Not sure if there is a good solution to this issue. I tried to add

    if logτ < -100
        logτ = -100
    end

This change made the program much slower.

I tried copy-pasting and running although it gave me some errors, but tracking through the updates it seems like the issue is gone (albeit, with slower code)? And, how slow are we talking about? 2x, 10x, more?

The code should run after deleting type annotations.
It’s like 10x slower with the if condition, so this solution it not very satisfying.

Hmm, I feel like there should be like an exponential-log-trick here

http://gregorygundersen.com/blog/2020/02/09/log-sum-exp/

But, I don’t see it immediately. It’s a common way to bound denominators of potentially unstable expressions in likelihood away from 0. Maybe something like this works? Also, see section 3.2 here for a more practical example, with importance sampling.

This will cause a type instability. Try with -100.0, check if that works better.

1 Like

I replaced -100 with -100.0, but did not notice much difference