I raised an issue at How to configure Optim.jl's LBFGS to have similar parameters to scipy.optimize's L-BFGS-B? · Issue #922 · JuliaNLSolvers/Optim.jl · GitHub.
Glad to see you’re making progress on finding various parameters of the algorithm that account for differences here.
One question I have: you describe the two implementations as producing similar results, but it seems like Optim produces -6.263329e+00 and Python produces -6.453688614845538. Those seem quite different to me – so different than some of the tolerance settings don’t seem relevant.
Adding another line search algorithm that has an acceptable license and produces better results seems like a great PR.
At this point you could profile to check whether most of the time is spent in your functions or in the optimization code itself.
As you can see here: https://gist.github.com/PharmCat/c6dee71211da6cb5221f1f0896a75d11
ForwardDiff works on your data. Yes - this is a bad solution, but it works. So you can try to change your method for storing temporary/initial data or you can specify different function only for gradient calculation with ForwardDiff.
Also you can try Optim.ConjugateGradient()
and Optim.BFGS()
Does Optim.jl also implement the newer modifications referenced at the start of the fortran code used by scipy? That paper is not referenced by Optim
I forgot to mention that the RNG for the Julia and Python version are different, and so the result are slightly different. The reason why the solutions are almost identical is that the randomness contributes only to fluctuation on top of an exponentially decaying trend for the cost of the green energy, and also just fluctuation of a mean-reversion trend for the cost of the brown energy.
Raised a tracking issue for this: Implement scipy.optimize's dcsrch (a modified version of Minpack2's code) · Issue #155 · JuliaNLSolvers/LineSearches.jl · GitHub.
I measured this for HagerZhang(linesearchmax = 20)
with LBFGS. Given that 1 function evaluation takes 28.514 μs, that the optimization takes 689.765 ms, and that there are 384 function calls for one optimization, then the function evaluations take 1.58% of the entire optimization.
Thank you very much for showing that it is possible to use ForwardDiff for my model!
I tested it, and it took 549.220 ms for an optimization, which is 21.6% faster. If the line search slowness can be solved, this would be blazing fast.
I have seen how you made the type generic, I will incorporate it into my model, since it doesn’t reduce readability / comprehensibility of the code at all.
Here generic implementation for ForwardDiff with Optim.ConjugateGradient()
Optim method.
The separate Firm structure used to store values. Gradient function provided from ForwardDiff.
Maybe it can be done more efficiently, but it leads to rethinking data structures …
Edit: you have many coefficients with the range 0.0 - 1.0, I think it may be a problem for line search. You can try to use the “sigmoid” function to link any input values with your range, and optimize it without limits. Also, this approach makes it possible to use the Newton method - it will take a lot of memory for hessian, but it can be more efficient.
In reasonably nice objectives I find that backtracking with quadratic interpolation is very efficient, even though it technically doesn’t satisfy Wolfe and therefore gives no convergence guarantee. The line search docs should help you figure out how to use that line search. I known it is supported
Have you tried to see if mkl works better? I wasn’t sure if this algorithm is Iinear algebra intensive. You never know, and mkl frequently dominates openblas (though by less if you tinker with thread settings). Matlab, for example, frequently beats Julia for performance in linear algebra intensive code, and after eliminating the usual suspects like globals and abstract types it almost always is because of MKL
Thank you for the improvement over the autodiff code!
This is the time with vanilla Optim.ConjugateGradient()
using your code:
742.269 ms (23957 allocations: 4.21 MiB)
The time with Optim.Fminbox(Optim.ConjugateGradient(linesearch = LineSearches.HagerZhang(linesearchmax = 20))), Optim.Options(g_tol = 1e-5, f_tol = 2.2e-9)
using your code:
651.240 ms (20708 allocations: 3.66 MiB)
The time using LBFGSB Optim.Fminbox(inner_optimizer(linesearch = LineSearches.HagerZhang(linesearchmax = 20))), Optim.Options(g_tol = 1e-5, f_tol = 2.2e-9)
:
485.330 ms (13076 allocations: 2.60 MiB)
i.e. 30.7% faster than finite difference.
The reason I chose LBFGS was because in the Python version, it has the most reliable result for my model.
To recap:
- The fastest 100% Julia method so far: 485.330 ms
- The fastest method so far (Julia model, but uses scipy.optimize via PyCall as suggested by @jovansam ): 14.093 ms
@antoine-levitt actually recommended using LineSearches.BackTracking
(How to configure Optim.jl's LBFGS to have similar parameters to scipy.optimize's L-BFGS-B? · Issue #922 · JuliaNLSolvers/Optim.jl · GitHub). It’s slower than LineSearches.HagerZhang
.
Hmm, no, I don’t use any vectorized nor linalg operations in my model.
Sorry to be late to the party. Unfortunately, you are not using L-BFGS-B in Optim, you’re using a much less efficient barrier method. In NLSolvers.jl I have a projected newton method, but since you’re already using finite differences for your gradients, this might be a very bad idea. It shouldn’t be too bad to adjust it to work with a BFGS-approximation though.
So I think this is three things: you’re using a suboptimal solver (“my fault”), you’re using finite differences (“your problem’s fault”), and your problem is not differentiable (it’s a simulation). I think the last two are actually worse for the barrier method than for l-bfgs-b, but let that be a conjecture.
If I look at a result I get
julia> result.minimizer
30-element Vector{Float64}:
1.62658413314516e-10
0.9999999998063619
0.999999999921528
0.9999999999434931
0.9999999999526344
0.9999999999586152
0.9999999999615918
0.9999999999635981
0.9999999999650679
0.999999999966145
0.9999999999668981
0.9999999999676399
0.9999999999681027
0.999999999968442
0.9999999999688338
0.9999999999689775
0.9999999999690062
0.9999999999691217
0.9999999999690876
0.9999999999689795
0.9999999999687171
0.9999999999682864
0.9999999999677559
0.9999999999668858
0.9999999999658892
0.9999999999643274
0.9999999999623169
0.9999999999596437
0.9999999999558721
0.999999999950147
this is not good for the barrier method. My guess is that the solution is at the boundary (one is 0 the rest are 1) and L-BFGS-B quickly lands exactly on a face of the constraint “box”. I 100% intend to implement L-BFGS-B in NLSolvers.jl but I think I might do projected BFGS first.
What is the latest version of the code here? (possibly with the gradient computed as suggested above)? I would like to try the method I implemented here, but actually running the example (extracting what is the function, initial point, and the gradient) is too complicated for me.
As you can see here https://gist.github.com/PharmCat/c6dee71211da6cb5221f1f0896a75d11
Newton works with sigmoid function and works fast.
But I think there are some issues in your optimization problem, I think the hypersurface is very flat near the critical point and it is very difficult to find the minimum. With some sigmoid modification, you can get exactly the same results as in Py, but the solution looks unstable.
Did you try NLopt? Or you want to use only L-BFGS method?
Well, for the sake of information. I tested the Optim LBFGS in comparison with the SPGBox method in a simpler example, and got the following, to reach the same solution:
lbfgs: 15.924 ms (8020 allocations: 9.00 MiB)
spgbox: 44.655 μs (12 allocations: 37.25 KiB)
julia> function run()
f(x) = sum(sin(v) for v in x)
x0 = rand(100)
lower = zeros(length(x0)) .- 0.5
upper = ones(length(x0))
g! = (g,x) -> ForwardDiff.gradient!(g,f,x)
r_optim = optimize(f, g!, lower, upper, x0, Fminbox(LBFGS()))
println(r_optim)
r_spgbox = spgbox!(f, g!, x0, lower=lower, upper=upper)
println(r_spgbox)
@test r_optim.minimizer ≈ x0
print("lbfgs:"); @btime optimize($f, $g!, $lower, $upper, x0, Fminbox(LBFGS())) setup=(x0=rand(100)) evals=1
print("spgbox:"); @btime spgbox!($f, $g!, x0, lower=$lower, upper=$upper) setup=(x0=rand(100)) evals=1
nothing
end
run (generic function with 1 method)
julia> run()
* Status: success
* Candidate solution
Final objective value: -4.794255e+01
* Found with
Algorithm: Fminbox with L-BFGS
* Convergence measures
|x - x'| = 1.71e-06 ≰ 0.0e+00
|x - x'|/|x'| = 3.43e-07 ≰ 0.0e+00
|f(x) - f(x')| = 0.00e+00 ≤ 0.0e+00
|f(x) - f(x')|/|f(x')| = 0.00e+00 ≤ 0.0e+00
|g(x)| = 1.71e-10 ≤ 1.0e-08
* Work counters
Seconds run: 0 (vs limit Inf)
Iterations: 3
f(x) calls: 883
∇f(x) calls: 883
SPGBOX RESULT:
Convergence achieved.
Final objective function value = -47.942553860420375
Best solution found = [ -0.5, -0.5, -0.5, ..., -0.5]
Projected gradient norm = 0.0
Number of iterations = 2
Number of function evaluations = 3
lbfgs: 15.924 ms (8020 allocations: 9.00 MiB)
spgbox: 44.655 μs (12 allocations: 37.25 KiB)
ps: SPG is a very simple algorithm, and that implementation is in pure Julia. If anyone wants to contribute to making it really available to others, you are all welcome. That implementation is a simple translation from another Fortran code, probably there are some things that can be written better in a more Julian style. Breaking changes in the API can be done, no package is currently depending on that.
Thanks for mentioning SPG, I didn’t know about this method, and it having a Julia package.
It is a popular method, particularly for how simple it is to implement, and also for being very memory-efficient. It is also family business, that method was originally proposed by my father and collaborators (here) .