Understanding the results from Optim.jl for a black box function

I’m using Optim.jl to solve an unconstrained minimization problem.
In this particular problem I have a black-box function, which can take a long time on a single function evaluation. I don’t have access to gradient information, and even though I have tried to use automatic differentiation, there are some parts of the code that the differentiator cannot handle and throws some errors.

Nevertheless, I’m using the L-BFGS implementation from Optim.jl with the finite differences approximation for the gradient information, and it seems to work fine. Unfortunately, I cannot show the full code here because it is not my property, but I can at least show a snippet of the code that does the optimization call

closure(x) = sqerror_model(x; phi=phi, rho=rho, model=model) # The objective function, a black-box
initial_x = randn(dim)
optim_res = Optim.optimize(
    closure, initial_x, Optim.LBFGS(; m=15), Optim.Options(iterations=200, show_trace=true)
)

With this snippet I’m trying to point out that there are 200 iterations and that I want to see the trace for each iteration. I also changed to m parameter for the optimization method to get a little bit more precision.

What I get in return is the following

Here, the dimension=3001 is the size of my parameter vector. This is the expected length of my minimizer.

I have some questions from this.

  1. What does the gradient norm equal to zero mean?
  2. I specified 200 iterations on the optimization call, why do I only see one iteration, namely the zeroth iteration?
  3. Why do I see zero on all the convergence measures? And also, what does the NaN mean in this convergence measures?
  4. At the end, it says that only one call to both the function and the gradient were done. Does this mean that only one iteration was done?

That looks fishy. Your screenshot indeed suggests that only the initial point is evaluated, then the method terminates because the gradient is 0 (= it is a stationary point).
Since you’re picking a random initial point, looks like the gradient is not evaluated correctly (if at all). NaN (not a number) probably indicates that, since you stopped at the first iteration, there’s nothing to compare to.

To see a trace at each iteration, you could put some printing instructions in your “closure” function, ie something like:

function closure(x)
  evaluation = sqerror_model(x; phi=phi, rho=rho, model=model)
  println("Current x: ", x)
  println("Current evaluation: ", evaluation)
  return evaluation
end
3 Likes

Thanks for the reply.
I used your suggested code, although not with the printing of the current value of x because it is very large, but with the current evaluation I get the feeling that the function is being evaluated a lot of times, I guess that’s because of the gradient estimation that’s taking place.

Anyway, the value is always the same as seen here

Current evaluation: 0.13872807013353833
Current evaluation: 0.13872807013353833
Current evaluation: 0.13872807013353833
Current evaluation: 0.13872807013353833
Current evaluation: 0.13872807013353833
Current evaluation: 0.13872807013353833
Current evaluation: 0.13872807013353833
Current evaluation: 0.13872807013353833

I have now used an initial value of zero everywhere, instead of a random initial point.
Maybe this does mean that this is a stationary point and no optimization is being done.
I’ll have to try with a different optimizer to check if the current evaluation changes.

This looks like a line search: BFGS computes a descent direction based on the gradient and tries to determine what fraction along this direction it should move to. But since the gradient is 0, it just stays where it is.

Does BFGS use finite differences by default to approximate the gradient?
Have you tried finite differences on your function to see what comes out of it? For example, you could try evaluating \frac{f(x + h e_1) - f(x)}{h} where e_1 = (1, 0, \ldots, 0) is the first unit basis vector, and h is around 10^{-3}. Do you get something nonzero?

Yes, Optim.jl uses finite differences to approximate the gradient. I’m using this as well because the automatic differentation does not work on my function.

I’ll test this out soon and let you know, thanks for all the help!

1 Like

I implemented the following solution, very slow and improvised

function gradient_test(phi, rho, model, dim)
    function closure(x)
        evaluation = sqerror_model(x; phi=phi, rho=rho, model=model)
        println("Current evaluation: ", evaluation)

        return evaluation
    end

    h_val = 1e-5
    for i in eachcol(I(dim))
        init_value = randn(dim) .- 50.0 # Change the evaluation every time!
        gradient_finite = closure(init_value .+ (h_val .* i)) - closure(init_value)
        gradient_finite /= h_val
        display(gradient_finite)
    end

    return nothing
end

and obtained the following

Current evaluation: 75.8385823602473
Current evaluation: 75.8385823602473
0.0
Current evaluation: 75.8385823602473
Current evaluation: 75.8385823602473
0.0
Current evaluation: 75.8385823602473
Current evaluation: 75.8385823602473
0.0
Current evaluation: 75.8385823602473
Current evaluation: 75.8385823602473
0.0

Here are some of my thoughts:

  • Even if I change the input vector init_value and use random values in each iteration, the function evaluation returns the same value, which is the reason why the gradient is zero. I don’t think this is intended, nor I believe it is right. Both evaluations, f(x + h e) and f(x) should be different, at least to some extent.
  • Indeed, the gradient is zero and that’s the reason why the optimizer was not doing anything at all.

I really don’t know how or why this is happening, but at least I got some useful insights that need further investigation.

What does h_val .* i do? Is it really the i th unit basis vector?

Yeah, if I loop over each column from I I get the boolean version of an identity matrix, which I believe is equivalent to the i th unit basis vector.

julia> using LinearAlgebra

julia> for i in eachcol(I(5))
       println(i)
       end
Bool[1, 0, 0, 0, 0]
Bool[0, 1, 0, 0, 0]
Bool[0, 0, 1, 0, 0]
Bool[0, 0, 0, 1, 0]
Bool[0, 0, 0, 0, 1]

and I could convert that to Float64 with the following

julia> for i in eachcol(I(5))
       println(Float64.(i))
       end
[1.0, 0.0, 0.0, 0.0, 0.0]
[0.0, 1.0, 0.0, 0.0, 0.0]
[0.0, 0.0, 1.0, 0.0, 0.0]
[0.0, 0.0, 0.0, 1.0, 0.0]
[0.0, 0.0, 0.0, 0.0, 1.0]

and when I do h_val .* i in the previous post, I get the value of he_i in the expression f(x + he_i), where e_i is the i th unit basis vector.
Well, hopefully I didn’t make a mistake somewhere, but I think it looks right.
Anyway, both versions give me the same results.

Cool, looks good.

You get evaluations different from 0.13872807013353833 (in your second message), so the function is not totally constant. But it’s weird that its value changes so little. Can you try with larger values of h?

I plugged in h=0.1 and, unfortunately, the result is almost the same.

Current evaluation: 75.6876240505038
Current evaluation: 75.6876240505038
0.0
Current evaluation: 75.6876240505038
Current evaluation: 75.6876240505038
0.0
Current evaluation: 75.6876240505038
Current evaluation: 75.6876240505038
0.0
Current evaluation: 75.6876240505038
Current evaluation: 75.6876240505038
0.0
Current evaluation: 75.6876240505038
Current evaluation: 75.6876240505038
0.0
Current evaluation: 75.6876240505038
Current evaluation: 75.6876240505038
0.0
Current evaluation: 75.6876240505038
Current evaluation: 75.6876240505038
0.0

Yeah, you’re totally right, very sorry about that. That’s because I had to change some other parameters within the code. If I stick with the same ones, I get something similar to those values obtained before

Current evaluation: 0.12281985076905867
Current evaluation: 0.12281985076905867
Current evaluation: 0.12281985076905867

I was changing the parameters because I need this procedure to be quite general. Those values only represent that some parameters give better estimations than others, but the problem is still the same, it keeps evaluating the same value, regardless of initial condition or any other thing, which is really troublesome. But that’s part of the numerical procedure itself, I don’t think it has something to do with the optimization part.

Got it.
Can you explain what the x values represent for your function? For example, are they parameters for a model that you need to fit against data points?
Can you check in the body of the function how these x values are used (and maybe post a small snippet to illustrate it)?

Unfortunately, I can’t post any more code.

I can say, though, that those x values are actually neural network weights, and the idea is to try to find the weights that give the minimum value for the black box function.
The way I’m trying to accomplish this is by evaluating some weights, and adjusting them using the L-BFGS optimization method. But if the gradient is zero, I’m getting nowhere, as you can already tell.

I guess the real problem lies within the black-box function, because regardless of the vector that I pass to the function I always get the same values in return.

So check that your function actually updates the weights of the NN from the x vector passed as parameter. Something in sqerror_model must be wrong.

1 Like

Yeah, there’s something very wrong there, I’ll make sure to check it out thoroughly.
Anyways, thanks for all the help and time, I really appreciate it!

1 Like

Without derivatives probably you should try a derivative-free method, like bobyqa. Numerical derivatives can be very expensive and very unstable.

Or, from Optim.jl, Nelder Mead - Optim.jl

1 Like

Thanks, I’ll try that out!