Wow, seems like Optimization.LBFGS
is seriously broken…
Yes it’s not registered yet. That’s part of the “but we need to finish and document it.” We use it in some projects already, especially in the GPU kernel solvers project for example, but it just needs a few finishing touches.
It is getting pulled out to be an extension, like all other solvers. Most of the work is already done New Subpackage for LBFGS by ParamThakkar123 · Pull Request #986 · SciML/Optimization.jl · GitHub it’s just breaking to merge that, so the whole big Optimization break (i.e. move the right things to OptimizationBase.jl with no solvers, change the preferred BFGS to be a proper native one with the right bells and whistles, remove some legacy stuff, etc.) will need to come at once.
Note that it’s just the classic FORTRAN LBFGS-B
The fact that it’s a Fortran code is why it also doesn’t support Reactant tracing.
[Now if you’re thinking, SciML never makes a wrapped code a default solver and misses type checking like that… yeah this is why I have been saying Optimization.jl is in desperate need of clean up. It needs to be moved out of being a privileged auto-installed solver (which isn’t something we do anywhere, the basic library is always solver-independent), it should be named OptimizationLBFGSB because that’s the Fortran name so it should mirror it, we should make sure there is a proper generic type supported core solver for this and benchmark it to death, etc. So yes this breaks a lot of standard SciML idioms, it’s known and pointing figures won’t fix it, I just need to find the time this fall to make Optimization.jl more like the other packages… but until then yes this is a little quirk I’ll need to root out. This is the next thing on my mind for after the dependency reduction / precompile improvements to OrdinaryDiffEq.jl / DifferentialEquations.jl … so more on that soon]
As far as I can see SimpleOptimization.LBFGS
will also actually not work right now if I presupply Reactant compiled gradients to OptimizationFunction
(since instantiate_gradient
is only defined for AutoForwardDiff
and AutoEnzyme
, and not for NoAD
), right? Is there a nice way around this or should I just be patient on this front? I’m relatively happy with ReverseDiff
+Optim.LBFGS
right now, but Reactant
compilation would be really helpful since right now I have GBs of memory usage just from Lux model applications inside my loss function.
Also @avikpal have you had a chance to look at the pure Enzyme situation here? It still seems weird to me that there’s seemingly no way to differentiate Lux model applications with respect to the parameters with Enzyme right now without hitting runtime activity or having to use Reactant.
Really weird discovery I just made: The runtime activity with Enzyme
+Lux
disappears entirely if the point
at which the model is evaluated is of type Vector{Int64}
(I also quickly tested Vector{Int32}
, also works). If the point is a Vector{<:Float}
, I get the runtime activity. I assume there’s some dispatch weirdness going on somewhere.
Quick reproduction:
using Lux, Random, Enzyme
model = Dense(2 => 1)
ps, st = Lux.setup(default_rng(), model)
# runs no problem
Enzyme.gradient(Reverse, only ∘ Lux.LuxCore.stateless_apply, Const(model), Const([0,0]), ps)
# fails because of runtime activity
Enzyme.gradient(Reverse, only ∘ Lux.LuxCore.stateless_apply, Const(model), Const([0f0,0f0]), ps)