How can I use Optim/BFGS on functions I can't evaluate but whose gradients I can compute?

Suppose I have function I want to maximize with Optim’s BFGS algorithm. While I can compute gradients for this function, I cannot compute its value (this is typical if e.g. a high-dimensional determinant is involved, since gradients of a determinant reduce to a trace, but evaluating the determinant itself is N^3).

BFGS with a fixed step-size (ie no line-search) should be fine since it never needs the value, only the gradients. I can set a fixed step-size with BFGS(; alphaguess = LineSearches.InitialStatic(alpha=0.1), linesearch = LineSearches.Static(), but how do I tell Optim to not to try and use the target function?

I’ve tried just having it return 0 but that seems to terminate the optimization early. Any suggestions for how I can do this? Thanks.

1 Like

Have you considered a root-finding method to solve ∇f=0? e.g. the analogue of BFGS for root-finding is Broyden’s method.

6 Likes

I’ve tried just having it return 0 but that seems to terminate the optimization early. Any suggestions for how I can do this? Thanks.

Have you tried just setting a negative tolerance for the objective function?

1 Like

I’ve implemented Broyden’s tooling in the form of an “updateable” linear operator as part of ProximalAlgorithms: it’s actually the modified Broyden method by Powell (1970), which ensures that the operator is always invertibile. You can check this test for an example of how to use it, it’s very easy especially if no line search is needed.

3 Likes

However, I guess if the gradient is C^1 then its Jacobian is symmetric and BFGS could work better

Also just out of curiosity, can you expand on this? To compute gradients of determinants, you need adjugates, which presumably are also N^3 ?

3 Likes

That’s a good point, thanks, I suppose the problem I’m really solving is ∇f=0. I wasn’t as familiar with Broyden’s but looking over it, it looks like the updates to the Hessian (at least as described on Wikipedia) are not exactly the same as what you’d get with BFGS on the equivalent problem. I’ll have to think about the different options (and thanks for the link @lostella). Fwiw, my problem is indeed C^1.

I did but still terminated early. The only solution I’ve found so far is to artifically have the target function return a continually descreasing sequence, so it never triggers the termination criteria. Something stupid like:

f₀ = 0
optimize(
    x->(global f₀-=1), 
    x->my_actual_gradient(x),
    x₀, BFGS(...), inplace=false
)

Yea sorry I needed to say more to make that make sense. You have that,

\frac{d}{dx} {\rm logdet}\, A(x) = {\rm tr}(A^{-1} \frac{dA}{dx})

In my case A is a positive definite N \times N implicit linear operator which can be applied in \mathcal{O}(N \log N), and similarly for the application of its gradient. You can do the trace with N_{\rm MC} Monte-Carlo simulations using Hutchinson’s method, and you can do the inversion with N_{\rm CG} conjugate gradient iterations, so the whole thing is \mathcal{O}(N_{\rm MC} N_{\rm CG} N \log N). Given my N\sim10^7, it works out that you can do the logdet gradient this way but the \mathcal{O}(N^3) determinant is hopeless (I’ve not found any stochastic determinant estimates that work here, although I’d be interested in hearing about suggestions!) I think this type of thing can arise often when you’re maximizing a Bayesian posterior and A is some high-dimensional covariance operator.

1 Like

You could use the stochastic Lanczos quadrature (SLQ) method to estimate the log determinant (= trace of log A).

Another nice thing about SLQ is that you can use the same Krylov bases to simultaneously compute your gradient trace, instead of using CG.

If you are using low-accuracy stochastic methods to estimate the function and gradient, you might want to use a stochastic optimization method like Adam. BFGS is not very robust to errors.

1 Like

For optim, try the allow_f_increases option maybe? If that doesn’t work, report it as a bug.

That’s a very fun problem you have. So you compute a sum of <u, inv(A) dA/dx u> = <inv(A) u, dA/dx u> with CG, which is independent of the size of x. Clever! Not sure if you can play the same trick with SLQ?

Conceptually you can swap out CG with the semiiterative Chebyshev method (which should even not be too inefficient if you’re in very large dimension anyway), which turns that into <p(A) u, dA/dx u> where p is a fixed polynomial. But even that has no nice antiderivative because the matrices don’t commute. Fun!

1 Like

Static is sort of an after thought, so we don’t handle this. I do get your point though… Can you post the actual error? That’ll make it easier to find. We will have to check for that line search in the lines that update the function and gradient.

Here’s what I checked, minimizing |x|^2 with a fixed step-size of 0.1. There’s no error, but if I just have my target function return e.g. 0, you can see it terminates early even with a negative f_tol,

using Optim, LineSearches, LinearAlgebra

bfgs = BFGS(
    alphaguess = LineSearches.InitialStatic(alpha=0.1), 
    linesearch = LineSearches.Static()
)

# do a proper minimization
optimize(
    x->dot(x,x),
    x->2x,
    ones(10),
    bfgs,
    inplace=false
).iterations # returns 181

# attempt to always return 0 from the target function,
# but with correct gradient
# this doesn't work and will terminate early
optimize(
    x->0,
    x->2x,
    ones(10),
    bfgs,
    Optim.Options(f_tol=-1, allow_f_increases=true),
    inplace=false
).iterations # returns 2

# the only workaround I've found 
f₀ = 0
optimize(
    x->(global f₀-=1),
    x->2x,
    ones(10),
    bfgs,
    inplace=false
).iterations # returns 181

Okay, try Options(f_abstol=-1, f_reltol=-1) perhaps.