Well a few things. This is, like another thread going right now, calling for SimpleOptimization.jl. It’s not registered yet but there lis an LBFGS in there:
and if you use that, then you can make your u0 be a static array with a quasi-newton method would will converge much faster than GradientDescent. But secondly, you could then use this with KernelAbstractions to then GPU-parallelize over different parameters. It would look just like the example of doing this with (Simple)NonlinearSolve.jl:
I know this at least has worked because it’s one half of this work:
https://openreview.net/pdf?id=nD10o1ge97
I.e. ParallelParticleSwarms.jl has a hybrid algorithm that first does an asynchronous particle swarm for many steps, and then finalizes by doing an LBFGS from every particle to finish the global optimization as a multi-start type of method. Because of this, we know that you can GPU-parallelize the SimpleOptimization.jl LBFGS kernel because that is exactly how the last step is done, though it’s currently not documented so use at your own risk etc. etc., I plan to get that stuff documented and released later this fall, but since it already has benchmarks and paper examples that means it’s already at least usable if you’re willing to give it a shot.