With SIMD the CPU may execute the same instructions on e.g. 4x differents datasets just as fast as on 1x dataset.

Assuming the user has a SIMD-enabled objective function - are there optimization algos that can exploit that? E.g. in Optim, Roots, SciML…?

The prime example would be the BiSection algorithm. Instead of halving the search space in every iteration, a SIMD-enabled BiSection could cut the search space to 1/4th (or 1/8th) in every iteration - allowing for exponentially faster convergence.

Another example could be the Newton algo that could optimize simultanously on 4 (or 8) different initial values - and return when one path converges.
Of course, this would require the user has both SIMD-enabled objective function and SIMD-enabled derivative function.

The tools don’t have an interface for it yet. But I was just chatting with @Elrod about doing this just last week. We’ll probably start doing some of this for all SciML problems once the GPU push is done this summer (there’s new algorithms in each area to be made for this). If you’re interested and want to help out, the SIMD-bisection is exactly the place to start and I can get the interface stuff together (needs just a batching interface)

I am not involved with SciML optimization solvers, but I have some thoughts about multi-starting Newton solver with SIMD parallelism:

As you pointed out, we would need SIMD-enabled objective or derivative functions, and we would also need a linear solver (either sparse or dense) that can take advantage of SIMD parallelism. I’m not aware of CPU implementations, but for GPUs, there are batched factorization routines for dense matrices. Exploiting SIMD parallelism for the solver’s internal computation wouldn’t be too difficult as they are mostly broadcast, map, mapreduce-type of operations.

In the case of constrained problems, sophisticated implementations of the second-order method have additional tricks that may lead to divergence in the computation pattern (e.g., inertia correction, second-order correction, restoration phase, etc.) These wouldn’t be compatible with the synchronous execution model of SIMD, so if we want to approach it with SIMD parallelism, some extra caution might be needed. In my opinion, multi-threaded parallelism would be a better approach here.

In our project MadNLP.jl, we have thought about the idea of a batched nonlinear solver, where a bunch (>100) of relatively small (~100 vars) optimization problems with the same structure are solved simultaneously. In theory, this can be implemented, as there are SIMD-enabled AD packages exist (e.g., ExaModels.jl) and batched linear solvers are available in CUDA, and like mentioned above, exploiting SIMD parallelism for the solver’s internal computation should be straightforward. But we haven’t yet found attractive enough applications to justify the development cost