Comparison of derivative-free / black-box optimization algorithms in Julia

Is anyone aware of a comparison between derivative-free / black box optimization libraries in Julia? If not, perhaps it is worth starting one here and finally putting it in the Julia Package Comparisons website?

At the moment I am not sure which one to use, Optim.NelderMead or BlackBoxOptim. In either documentation I couldn’t find a comparison. However, there appears to be more and more packages popping up e.g., I just found about PRIMA. And Optimization.jl wraps even more!

Note that my question has two aspects: the scientific aspect in mind (limitations, stability, etc.) and of course the computational performance, given a problem that all algorithms can solve, which one does it faster. @ChrisRackauckas shared with me this repo for a performance comparison: GitHub - jonathanBieler/BlackBoxOptimizationBenchmarking.jl: BlackBoxOptimizationBenchmarking (which unfortunately needs to be updated and rerun)

In any case it is probably useful to have a comparison in JuliaPackageComparisons so we can use this thread to make it! (yes, I love that website!!!)

1 Like

The tl;dr is, as GitHub - jonathanBieler/BlackBoxOptimizationBenchmarking.jl: BlackBoxOptimizationBenchmarking shows BlackBoxOptim.jl “generally” does the best for global optimization. And there’s not enough benchmarks out there right now, but PRIMA.jl “generally” does the best for local derivative-free optimizations. But Metaheuristics.jl is also generally good, might need more tweaks, and there’s some good stuff in NLopt.jl and Optim.jl, and so generally using the Optimization.jl interface and trying a bunch of black box optimizers will be required to find what’s best for a given problem.

Just a historical note, Optimization.jl was first called GalacticOptim as pun on how it’s a global interface for global optimizers. Choosing a global optimizer is not generally easy so the purpose of the interface was really to pull together all of the global optimizer packages into a single suite to make it easy to build interfaces that were package-agnostic at this level. Of course, Optimization.jl then grew to do lots of other things, especially with local optimizers, and now it’s rather general. That said, I’d highly recommend it over hardcoding a package to a single global optimizer because from our experience, every global optimizer has different merits and that’s exactly how we ended up here.

But anyways, what you’re really looking for is a solver recommendation page like Nonlinear System Solvers · NonlinearSolve.jl that describes when one should use each method within Optimization.jl. The reason why the Optimization.jl documentation is still missing this page is because we simply do not have enough benchmarks to make good enough recommendations right now. We are working on some new global optimizers (PSOGPU.jl) and with that are setting up a whole set of benchmarks for SciMLBenchmarks.jl and so hopefully by this summer we should have a recommendations page that’s backed with benchmarks like we do in the other SciML packages. Until then, we just have the API docs per package and let people go ham because :sweat_smile: who knows what to recommend.

5 Likes

I wish GalacticOptim was the name kept :cry:

4 Likes

@pkofod also commented on Slack:

The history of some of these is “interesting”, for example collective “we” had extremely poor understanding of when Nelder Mead works and when it doesn’t for a long time, but it’s survived for more than half a century as a default in many packages (thought I think it should be changed in many cases to a DFO trust region alternative, but they are less well-known and harder to implement). “We” understand a bit more now, but theoretically, it’s not the most well-founded approach. Though it often “works” and doesn’t require gradients → many people will accept it.
That said, I would not even consider PSO and NM in the same aisle of the optimization supermarket, though I know some people think that NM does OK with avoiding bad local minima if you start the simplex large enough (and maybe restart it with rotations of it). That said, any algorithm that you allow to visit many objective values will do quite well - especially if you can polish them with local optimizers. So… :slightly_smiling_face:

I dunno, it only tried a subset of the ones in NLopt, e.g. didn’t try DIRECT-L or CRS2 or my favorite, MLSL combined with local optimizer (ideally gradient-based; requires some tuning of the local-search tolerance), and for some reason they continue to use Nelder–Mead. And, of course, how well these algorithms do is very problem dependent — you are well advised to experiment with a few algorithms on your own problem.

Note also that there is a vast difference between optimizing a smooth function with a few dozen or a few hundred local optima, and optimizing a function with a few thousand or a few million local optima — basically, there is an algorithmic tradeoff in how much effort is spent in global vs. local search. (As well as between optimizing smooth vs non-smooth, though it’s not clear if they are including the latter in this case.) Algorithms that perform well in one case may perform terribly in another. Ideally the benchmark results would try to classify the problems a bit more rather than lumping them all together.

For local derivative-free optimization of smooth functions (at least once or twice differentiable), I usually prefer one of the Powell algorithms (I’ve used the older ones in NLopt, based on an f2c translation of Powell’s original code, but it is reportedly even better to use the more modern implementations in PRIMA). I don’t know why people bother with Nelder–Mead (which may not converge to a local optimum, and has no real support for constraints) in this century, at least for smooth functions.

3 Likes

Agreed it needs to be better, which is why I want to get a version onto the SciMLBenchmarks and start to maintain it ourselves. There are many tweaks I want to do to it. But for now it’s probably the best benchmark set I have to point to.

I like the idea of the TikTak algorithm from TikTak — Fatih Guvenen (that page also contains a link to a paper where they benchmark it against other solutions and find that it works well)

@Tamas_Papp started an early implementation in MultistartOptimization.jl but it needs work to be considered feature complete (parallelization, for example)

My guess is: it’s the default in everything from R to Python to Matlab. If you’re using it outside of such a system it’s very much possible for a half-decent undergrad to program it up, while most grad students would have a hard time implementing Powell’s algorithms or the “DFO trust region”-book methods. At the same time, the gradient based methods including line search and trust region methods seem much more approachable because you do not have to do the same “hand holding” of the objective model function for it to keep useable numerical properties. In the “gradient exists” case, BFGS is much more forgiving and can always be reset if your updates are not accepted for a few iterations.

Edit: So that is why it’s the only local search derivative free method in Optim for example, because unlike NLopt or Optimizers.jl it’s not really my objective to provide all sorts of algorithms, it’s to have a self-contained system of optimizers in Julia code with MIT license. It may be stupid to not just support PRIMA for example, but I would say I am quite open to adding it as an extension for example. That way people can use the Optim.jl interface if they like it.

PRs are most welcome.

That said, parallelization only benefits the first stage, the second one is not parallelization (as is, maybe the algorithm could be modified). I found that utilizing threads in the evaluation of the objective (whenever possible) nearly saturates the CPUs pretty well so I did not purse parallelization in the algorithm per se.

Also, I have become very skeptical of optimizing hairy functions in social science, especially in the context of estimation (SMM or ML). Having gazillions of local optima is usually a signal that the model is a bad fit to the data (cf the folk theorem of statistical computing). Instead of throwing a sophisticated black box optimizer on it, IMO generally it is better to pare it to start over, pare it down to a smaller model that is well understood, estimate, add one part, estimate, etc. That way I trust the final outcomes much better and may know which local optima are irrelevant.

Also, in economics models are not exogenously given (as opposed to, say, an engineering problem) so we have a lot of room to modify them. Making an objective ADable is usually the best investment.

7 Likes

Smoothing seems to be the standard approach, yes. A lot of smoothing can of course get rid of all local minima besides the global minimum :smiley: I agree with more than a few minima being suspicious in a statistical setting. You may have identification as in having a global minimum, but you should be worried about that identification relying on knife-edge differences in data or assumptions. I find the methods in something like dynamical markov games quite brittle, and the conclusions should be viewed in that light. If we’re in a physics or engineering setting where you may not have to worry about all those local minima because they are not relevant, you’re just looking for the global one: fine. If you’re in a dynamic bertrand game… eh… Probably you need to qualify your problem a bit better if you have 10 million equilibria in a two-player game and you want to make market predictions for the book market in Chile :slight_smile:

1 Like

It might be useful to have a look at the original benchmark :

Latest results :

The description of the functions :

To note I never implemented the last 4, ran out of steam at the 20th.

2 Likes

I like the multistart approach for its simplicity and its better reliability. It can also be used with AD-able objective functions. As you are already aware, the fact that an objective function can be differentiated has very little to do with how easy it is to find the global minimum.

Mathematically, of course you are correct; the two are orthogonal. But in practice, at least in social sciences, both may be consequences of your problem being not necessarily well-defined.

Results from convex problems are usually the most convincing; mild violations of convexity are OK as long as one can explore the domain, but at the other end of the spectrum if you need a sophisticated optimizer to find the best model fit among 10^6 local optima with each of them being close to each other (on a log scale) but corresponding to wildly different parameter values, then chances are that you are just fitting the wrong model and understanding what is going on would be a better investment of time than optimizing.

1 Like