I’ve been comparing the performance of Optim.jl with Matlab’s fminunc. Surprisingly, Optim's L-BFGS algorithm doesn’t always beat fminunc. I was wondering if anyone knows why this might be.
Here’s the set up:
Use a single processor for each language (i.e. load Matlab with -singleCompThread option; don’t use parallelization in Julia)
Generate identical data sets [up to random noise] for a common maximum likelihood problem: the normal linear model
Provide the analytical gradient of the obj. function to fminunc; use autodiff with Optim
Estimate the maximum likelihood parameters using identical starting values [up to random noise]
Compare the timing of the estimation under various scenarios optimization options, or various starting values
I’m using Julia 0.6.0 and Matlab R2015a on a RHEL Linux machine with an Intel Core i-7 processor with 8 CPUs
The complete code for both examples is available as a gist here.
Some of my observations:
Optim performs much better than fminunc when the starting values are “good” (i.e. close to the solution) but much worse when the starting values are “bad.” I’m surprised by this, given that this objective function is about as well-behaved as one could possibly be.
Does anyone know why Optim seems to be more efficient of an optimizer than fminunc, but less robust? What I mean by that is that fminunc can find the solution given pretty much any feasible set of starting values, whereas Optim takes much longer to find the solution, or sometimes doesn’t find the solution at all, given poor (but feasible) starting values.
To clarify: is there any performance gain to Optim by specifying the gradient by hand rather than using autodiff? My reading of the docs suggests that there isn’t, but I wanted to make sure.
Are there any other options or non-default settings I should know about that will improve performance? I didn’t see BackTracking() listed anywhere in the docs, so it would seem that I could be missing something major.
You may also want to try other optimizers (search direction generators), not just L-BFGS, I found the ConjugateGradient to be very often very fast and robust for some of my problems.
If you can find the Hessian of your objective function or some approximation of it, even just the diagonal, you may also want to use that approximate Hessian as a preconditioner in L-BFGS or ConjugateGradient which take a preconditioner https://julianlsolvers.github.io/Optim.jl/stable/#algo/precondition/. This will increase the cost of computing a search direction but it will improve its quality so you can benefit from spending more time (or function evaluations) searching along this expensively computed search direction, which may get you closer to the local optimal solution.
At the end of the day, only experimenting will show you the right path, I am only giving general advice.
Yes there is. In theory with a great implementation of reverse-mode differentiation there shouldn’t, but in practice we are not there yet. In particular, the default autodiff uses forward-mode differentiation, which is intrinsically slower. If the docs are imprecise there, feel free to point it out (or even better submit a PR!)
Regarding options: play around with LBFGS’s history size, and the line searches (BackTracking is the best choice for LBFGS in my experience).
I’m flattered (on behalf of all the contributors https://github.com/JuliaNLSolvers/Optim.jl/graphs/contributors ), but Optim is a project started by, then grad student, John Myles White, and later development and maintenance has been continued by myself with great help from other Julia community members (some have responded here), especially Asbjørn Riseth (anriseth). However, I think it’s always a good idea to keep in mind that if fminunc was found to be significantly slower/worse than Optim, MathWorks would actually be able to spend money on improving it. Unfortunately, we don’t have that possibility beyond gsoc (yet?).
There are quite a few things to be careful with here.
First, generate the data once and estimate your coefficients in Julia and Matlab using the same data. Why? In what you show, the -loglikelihood is lower in Optim than in Matlab, but is that because the data is different? It seems like a big dataset, but you really need to compare optimization attempts on the same objectives to say anything for sure.
Second, as already mentioned, be sure to compare things that are claiming to do the same things! (L-)BFGS can behave differently than a Newton’s method using trust regions.
Third, as already mentioned, the line search choices is very important. We’ve been working quite a bit on it in the spring, and hopefully we can provide some further insights over the coming year. Help is welcome!
Fourth, it seems like a whole lot of allocations are being made. Could this be from ForwardDiff somehow? Was finite differencing (the default) slow since you changed it?
I have never really compared Optim to Matlab or SciPy. Partially because I don’t really use those tools myself. Happy to contribute to such investigations, and certainly happy to learn and improve!
This is more likely due to me not coding very well in Julia . I never did use the finite differencing because I used the analytical gradient in Matlab and wanted to compare ForwardDiff with analytical.
Thanks also for all of your other tips, and for offers to help. My background is not in computer science, and I taught myself coding in Matlab (meaning my coding habits are poor), so I don’t know if I have much to contribute. But I’d be glad to pitch in if there’s anything I can do. I spend a lot of time optimizing, so I can definitely be a tester of new offerings if nothing else.
My main idea that prompted this thread was that, because Julia is so much lighter and faster than Matlab, it would smoke Matlab at anything. But Matlab’s environment is also heavily optimized with calls to compiled C/C++/Fortran objects, etc., so in the end I’m learning that the advantage is not so clear.
Well there’s a few things, but yes algorithms + efficient code is what you need to get top notch speed. Autodifferentiation is going to cost about 2x of a function cost since you have to propagate both both the real and the dual parts, so using the analytical solution is just a better algorithm. Also, MATLAB’s fminunc is probably more optimized right now, so less f and gradient calls are probably necessary. It’ll get optimized over time, that’s the beauty of open source, but right now it’s probably just comparable. You can also try NLopt.jl which is a more established optimization package where you can use the Julia function: effectively giving the best of both worlds for now.
I don’t think that’s the main problem: it’s more forward vs reverse at that point.
It’d be very interesting to get data on that. When testing against another library, I’ve found Optim with LBFGS/backtracking to be pretty competitive. If there are significant discrepancies in terms of f/g calls wrt matlab, that’d be useful to know to try and improve it.
We need real data. Work precision plots of runtime vs accuracy and function calls vs accuracy on a few problems. Anything else is pretty much junk data since there are infinite alternative explanations.
This is what I was trying to say above. Comparing forward diff with hand coded gradients, maximum likelihood on two different data sets, lbfgs vs newton w/ trust regions… It’s not a very clean comparison, so there’s not much to conclude. I’m not picking on OP, just agreeing that if we need to make any comparisons, we need to be careful.
I don’t have a matlab license right now, but I’d be happy to comment and cooperate on such a comparison.