Conceptual Question on GPU Integration with Optimization.jl

This is more of a conceptual question about CUDA integration (CUDA.jl) with Optimization.jl

Does the CUDA integration speed up both compilation AND solving time, or just solving time?

Does Optimization.jl support CUDA integration?

Note that Optimization.jl does not implement the solution algorithms. It just forwards the problem to a solver backend. In most cases, these solvers do not support GPUs.

If Optimization does support CUDA and there is a solver which exploits GPUs, then I would expect the compilation time to be a little slower (because there is more work to do) and the solving time to be highly problem-dependent. In most cases, GPUs do not improve the performance of optimization algorithms. (As one exception, see https://github.com/sshin23/ExaModels.jl.)

Yes it’s used in a lot of examples throughout the ecosystem

https://docs.sciml.ai/NeuralPDE/stable/tutorials/gpu/

Note that we do have a set of solvers coming out soon that will GPU in some nice ways. That’s about 3 months out though, aiming for before JuliaCon.

Generally it depends, but usually the compilation is a little higher since most of the time there is at least some kernels to build due to broadcast. But there’s some work in Julia v1.11 for being able to cache some more of this so that should be helpful in the future.

Extremely dependent on the problem. Generally you need to have some kind of O(n^2) or O(n^3) behavior for it to make sense. Matrix multiplications (neural networks), LU-factorizations (stiff ODE solves), or something of the sort. If it’s a bunch of O(n) operations I would expect magic.

But note that there’s effectively 3 different ways that GPUs can be used here, and we shouldn’t conflate the 3.

  1. There’s doing GPU operations in the user’s f, where all the optimizer really needs to do is ensure that its operations keep the state on the GPU (to reduce memory overhead) and thus the core aspect of performance is whether your f is suitable for GPUs. This is what you would do for very large state vectors, like the weight vector of a deep neural network.
  2. There’s solving many optimizations simultaniously on GPUs. Basically GitHub - SciML/DiffEqGPU.jl: GPU-acceleration routines for DifferentialEquations.jl and the broader SciML scientific machine learning ecosystem but for Optimization.jl. While technically (1) can be used to implement this way (make a block sparse optimization problem), in practice that’s slow. In the DiffEqGPU paper we showed that is how PyTorch and Jax tend to do things and it’s about 20x-100x slower than building custom kernels (https://www.sciencedirect.com/science/article/abs/pii/S0045782523007156). We have some tooling coming out soon for doing similar things to DiffEqGPU but for nonlinear solving and optimization. For this is only basic algortihms (BFGS), on small enough problems to fit in GPU registers (so state vectors of size about 100 or so max), for the purpose of solving a parameterized problem over many paramaters really fast.
  3. Using GPUs as part of the optimization process. I.e., take a serial optimization and make it more parallel. This is where we have what I would think our coolest new project taking place, and I’ll withold details until the results are public. But basically, this is also a case where the problem wouldn’t be too big (state vector at most in the hundreds) where the optimization problem is hard.

So again, more details on (2) and (3) coming very soon (it’s been one of the big Julia Lab projects since the DiffEqGPU paper work was completed), but for now you can find the tutorials to do (1). Whether that is useful is largely dependent on context as described above.