# ExaModels and MadNLP: Solving Large-Scale Nonlinear Programs on GPUs

We are pleased to announce that ExaModels v0.6 and MadNLP v0.8 now have full support for solving large, sparse, constrained nonlinear programs on NVIDIA GPUs up to high accuracy. In this posting, we will go over the key features of ExaModels and MadNLP, with some benchmark results showcasing these capabilities.

TL;DR: with ExaModels and MadNLP on GPU, one can solve large-scale nonlinear programs significantly faster than the state-of-the-art alternative tools on CPUs. Below demo is comparing ExaModels + MadNLP + cuDSS vs ExaModels + Ipopt + ma27 for solving AC OPF case13k problem. Team CPU is on the left, and team GPU is on the right.

## Introduction

### What Type of Problems can ExaModels and MadNLP solve?

We target large-scale, sparse, constrained nonlinear optimization problems of the following form:

```
min f(x)
s.t. x_L <= x <= x_U
g_L <= g(x) <= g_U,
```

where `f`

and `g`

are twice continuously differentiable. Furthermore, we target problems with sparse constraint Jacobian and Lagrangian Hessian. Examples of such problems include AC optimal power flow problems, optimal control (with direct transcription), state/parameter estimation problems (with direct transcription), discretized PDE-constrained optimization, etc.

Solution algorithms for these problems have been previously considered to be incompatible with GPU due to the need for sparse indefinite factorization (LBL^T) and the sequential nature of numerical pivoting employed within the factorization algorithm. However, thanks to the recent progresses, we now can solve these problems on GPUs efficiently. If you want to know more about the details of these methods, please check our recent publication.

**What are the state-of-the-art tools?**

We typically need three software pieces to solve such problems: **an automatic differentiation tool** (for evaluating first/second order sensitivity of `f`

and `g`

), **an optimization solver** (for applying optimization solution algorithms, such as interior point method or sequential quadratic programming), and **a sparse, symmetric, indefinite linear solver** (for computing Newtonâ€™s step directions).

The state-of-the-art software tools for each of these tasks include:

- Optimization solver:
`Ipopt`

(open-source),`Knitro`

(commercial), etc. - Automatic differentiation:
`JuMP`

(open-source),`CasADi`

(open-source),`AMPL solver library`

(commercial), etc. - Sparse solver:
`ma27`

(freely available for noncommercial use),`mumps`

(open-source),`pardiso`

(commercial), etc.

These tools are often quite old, and they are not designed to run on GPUs. There have been recent attempts to solve these problems on GPUs, including by our project as well as hiop and HyKKT.

## What are the key features of ExaModels.jl and MadNLP.jl?

ExaModels.jl can perform automatic differentiation (for computing sparse first and second-order derivatives) on GPUs, while MadNLP.jl can apply the condensed-space interior point method to solve the problem on GPUs using external sparse linear solvers. By default, MadNLP uses cuDSS, NVIDIAâ€™s brand-new direct sparse solver, which has quite an impressive performance. With the combined capabilities of ExaModels, MadNLP, and cuDSS, one can solve nonlinear optimization problems purely on the GPUâ€”all the array data residents on device memory only, and with a majority of the operations benefiting from GPU acceleration, which can lead to quite significant performance improvements over CPUs.

## ExaModels and MadNLP API

The best way to try out these features is to model nonlinear optimization problems using ExaModelsâ€™ native API. For the details of the ExaModels API, please visit ExaModels documentation. Hereâ€™s a brief example of using ExaModels to create a nonlinear optimization model for GPUs:

```
using ExaModels, MadNLP, MadNLPGPU, CUDA
function luksan_vlcek_model(
N;
T = Float64, # precision
backend = nothing, # GPU backends, e.g., CUDABackend()
kwargs... # solver options
)
c = ExaCore(T; backend = backend)
x = variable(c, N; start = (mod(i, 2) == 1 ? -1.2 : 1.0 for i = 1:N))
constraint(
c,
3x[i+1]^3 + 2 * x[i+2] - 5 + sin(x[i+1] - x[i+2])sin(x[i+1] + x[i+2]) + 4x[i+1] -
x[i]exp(x[i] - x[i+1]) - 3 for i = 1:N-2
)
objective(c, 100 * (x[i-1]^2 - x[i])^2 + (x[i-1] - 1)^2 for i = 2:N)
return ExaModel(c; kwargs...)
end;
m = luksan_vlcek_model(10000; backend = CUDABackend())
result = madnlp(m; tol = 1e-8)
```

Note that unless the keyword argument `tol = 1e-8`

is given, `MadNLP`

will solve the problem up to `1e-4`

.

Optionally, one can use the JuMP interface as follows.

```
using ExaModels, MadNLP, MadNLPGPU, CUDA, JuMP
model = my_jump_model( ... )
set_optimizer(model, () -> ExaModels.MadNLPOptimizer(CUDABackend()))
optimize!(model)
```

Note that this comes with some performance compromise, and the model creation and AD performance will not be as performant as the native ExaModels API.

## Performance

In the following table, the performance of `MadNLP + ExaModels + cuDSS on GPU`

is compared with `Ipopt + ExaModels + Ma27`

. Note that ExaModels and Ma27 are state-of-the-art on single-threaded CPUs, as demonstrated by the rosetta-opf benchmark and a technical report comparing linear solver options for Ipopt. The tolerance for both configurations are set to `1e-8`

.

a =converged to â€śacceptableâ€ť tolerance (=1e-6)

i = infeasible

f = failed convergence

Weâ€™re observing quite an impressive performance of ExaModels + MadNLP +cuDSS, compared to the state-of-the-art alternatives running on CPUs. For example, for solving the largest case AC optimal power flow problem (case70k), MadNLP + ExaModels + cuDSS can solve the problem about 18 times faster than the state-of-the-art alternative on CPU (Ipopt + ExaModels + Ma27), with virtually no difference in terms of solution accuracy.

Still, thereâ€™s a caveat. The condensed-space IPM approach we adopted in MadNLP to ensure GPU compatibility causes an increased degree of ill-conditioning of the KKT system. This makes the convergence behavior less reliable than the full-space interior point method implemented in Ipopt. As a result, occasionally, some convergence issue is faced. Due to this limitation, we have lowered the default tolerance to `1e-4`

, though for a large number of problems weâ€™ve been considering, convergence is achieved with `tol=1e-8`

. Specifically, among 30 cases considered above, 4 cases encountered convergence issues. All the cases can be solved without an issue if we set the tolerance to `1e-6`

. We are actively working on developing strategies to stabilize the convergence behavior.

## Portability

Originally, we aimed to develop tools that can run on AMD/Intel GPUs. Weâ€™ve been using KernelAbstraction.jl extensively to support different types of GPU backends. Thanks to this, ExaModels has full compatibility with Intel and AMD GPUs. However, solving the NLPs fully on AMD/Intel GPU is not possible as of today, because as far as we know, there is no implementation of sparse LDL/Cholesky factorization on AMD/Intel GPUs that we can interface with MadNLP. We are interested in testing MadNLP with AMD/Intel GPUs in the future once a good sparse solver option becomes available on these platforms. If the community can develop some portable, GPU-accelerated version of `LDLFactorizations.jl`

, it would be greatly beneficial for us

## Closing Thoughts

As demonstrated in the case70k AC OPF example, GPUs provide exciting opportunities to significantly improve the scalability of large-scale nonlinear programming. Despite some hurdles, we believe that the progress and promise are undeniable.

If you are working on large-scale sparse nonlinear optimization (e.g., energy systems, optimal control PDE-constrained optimization, etc.) and looking for faster solution methods, please try using `ExaModels + MadNLP,`

and let us know what you think. We welcome inquiries and suggestions from the community.