Multi-core parallel support for JuMP supported solvers?

charshaw · April 2, 2024, 12:24am

I am looking for advice on how to enable solver parallelization when using JuMP on a high performance computing cluster. I have read this very informative page, but it is mostly about running many JuMP models in parallel. On the other hand, I have one JuMP model and I’d like to instruct the solver use all of the available cores.

I have been using SCS without any parallelization. I noticed that the underlying C library is compiled with the USE_OPENMP flag set to zero, so I suspect that it is compiled to not support parallelism by default. Although multi-threading is not quite what I’m looking for, I’ve also noticed that MathOptInterface does not support setting the number of threads for SCS. So, I’ve determined that I probably cannot use SCS across multiple cores, at least not without some significant changes.

My question is: are there JuMP-supported quadratic SDP solvers that can be instructed to use multiple cores on an HPC cluster? And, if so, how would I go about setting this up?

A bit of background on my problem: I’m solving a moderately sized(?) program with 16k variables, semi-definite and linear constraints, and a linear + quadratic objective. It takes a few hours to solve with SCS solver. I’m happy to provide a MWE if it would be helpful. Thanks!

odow · April 2, 2024, 1:36am

I’m solving a moderately sized(?) program with 16k variables, semi-definite and linear constraints, and a linear + quadratic objective

Try using MosekTools.jl (commercial) or Clarabel.jl (open source) instead.

are there JuMP-supported quadratic SDP solvers that can be instructed to use multiple cores on an HPC cluster?

Not really, and I don’t know if there would be much benefit. The algorithms are iterative, so they don’t parallelize very well.

I’m happy to provide a MWE if it would be helpful

That would be helpful. You might have more success reformulating your model than trying to parallelize a solver.

charshaw · April 2, 2024, 5:19pm

Thanks for your help, Oscar!

Let me begin by describing the program we are solving. Given n-by-n positive semidefinite matrices A and W and a set of constrained off-diagonal entries \Omega \subseteq \{ (i,j) : 1 \leq i < j \leq n \}, we seek to solve the following:

\min_{ \substack{ X \succeq 0} \\ X(i,j) = -A(i,j) \ \forall (i,j) \in \Omega } \textrm{tr}(X W) + \gamma \cdot \| A + X \|_F^2

Here’s a MWE which creates a random problem instance and uses JuMP to create a model and pass to a solver. My apologies that it is not-so-minimal. Our problem size of interest has n=2000 but here I used n=150 for speed.

using JuMP
using SCS, MosekTools #, Clarabel
using LinearAlgebra
using StatsBase
using Random
using BenchmarkTools

function create_rand_psd(n)
    Y = Y = randn(n,n)
    A = Y * Y'
    return A / tr(A)
end

function create_constrained_entries(n)

    # sample random 10 % of off-diagonal entries to be constrained
    all_off_diag = all_off_diag = [(i,j) for i=1:n, j=1:n if i < j]
    num_constrained = ceil(Int, 0.1 * length(all_off_diag))
    constrained_entries = sample(all_off_diag, num_constrained, replace=false)

    return constrained_entries
end

function mwe_program(A, W, constrained_entries, fro_penalty, solver)

    # get dimensions
    m = size(A,1)
    @assert fro_penalty > 0

    # create the program
    model = Model(optimizer_with_attributes(solver.Optimizer))

    # flatten and filter unobserved pairs
    num_ce = length(constrained_entries)

    # construct the scalar variables x_{i,j} which are free
    num_vars = div(m*(m+1), 2) - num_ce
    @variable(model, x[1:num_vars])

    # create the expression for the objective function
    obj_ex = zero(QuadExpr)

    # construct the symmetric affine JuMP expression X  -- filling in only upper triangular
    X = zeros(AffExpr,m,m)
    ind = 1
    for i=1:m
        for j=i:m
            if (i,j) in constrained_entries
                X[i,j] += - A[i,j]
            else

                # set the element in X
                X[i,j] += x[ind]

                # update the objective function
                count_coeff = (i == j) ? 1.0 : 2.0
                if W[i,j] !== 0.0
                    add_to_expression!(obj_ex, count_coeff * W[i,j], x[ind]) # linear part
                end
                add_to_expression!(obj_ex, count_coeff * fro_penalty, (x[ind] + A[i,j] )^2 ) # frobenius norm

                # update index
                ind += 1
            end
        end
    end
    
    # constraint that X is psd
    X = Symmetric(X) 
    @constraint(model, X in PSDCone())
    
    # minimize the model
    @objective(model, Min, obj_ex)
    
    # run the optimization
    optimize!(model)
    
    # construct the upper bound matrix
    return value.(X)
end

# set seed 
Random.seed!(123)

# create problem instance
println("\nCreating the problem instance...")
n = 150
A = create_rand_psd(n)
W = create_rand_psd(n)
constrained_entries = create_constrained_entries(n)
fro_penalty = 1.0

# run the instance --  775.830 ms (833897 allocations: 63.13 MiB)
println("\nBenchmarking the JuMP program with SCS...")
@btime X = mwe_program(A, W, constrained_entries, fro_penalty, SCS)

# run the instance -- 45.985 s (934494 allocations: 72.71 MiB)
println("\nBenchmarking the JuMP program with Mosek...")
@btime X = mwe_program(A, W, constrained_entries, fro_penalty, Mosek)

# run the instance -- doesn't complete a first iteration within 2 minutes
println("\nBenchmarking the JuMP program with Mosek...")
@btime X = mwe_program(A, W, constrained_entries, fro_penalty, Clarabel)

Any help or advice in reformulating the model would be greatly appreciated!

Try using MosekTools.jl (commercial) or Clarabel.jl (open source) instead.

Thanks for the advice. Could you elaborate more here on why MosekTools.jl or Clarabel.jl would be preferable? I did a quick benchmark on my laptop (in the MWE) which tests MosekTools.jl and Clarabel.jl against SCS.jl. Although it’s a much smaller problem size, it seems that SCS.jl performs better than both of these. In fact, Clarabel.jl does not complete the first iteration after a few minutes at n=150, though it does complete in 6 seconds for n=50. I haven’t yet run a test on a larger size though.

Not really, and I don’t know if there would be much benefit. The algorithms are iterative, so they don’t parallelize very well.

Thanks for the advice here. It makes sense that paralleleization may not be beneficial because the algorithms are iterative. But, I would have guessed that the underlying linear systems solves would be amenable to parallelization. Do you mind explaining a bit more why this isn’t the case?

odow · April 2, 2024, 7:28pm

I’d simplify your model somewhat (for clarity, not for performance reasons), by writing:

function mwe_program2(A, W, constrained_entries, fro_penalty, solver)
    m = size(A, 1)
    @assert fro_penalty > 0
    model = Model(solver)
    @variable(model, X[1:m, 1:m], PSD)
    @constraint(model, [(i, j) in constrained_entries], X[i, j] == -A[i, j])
    @objective(model, Min, LinearAlgebra.tr(W * X) + fro_penalty * sum((A + X).^2))
    optimize!(model)
    @assert is_solved_and_feasible(model)
    return value.(X)
end

I don’t have a Mosek license so I can’t test, but you might get better performance with:

function mwe_program2(A, W, constrained_entries, fro_penalty, solver)
    m = size(A, 1)
    @assert fro_penalty > 0
    model = Model(solver)
    @variable(model, X[1:m, 1:m], PSD)
    @constraint(model, [(i, j) in constrained_entries], X[i, j] == -A[i, j])
    @variable(model, t)
    @constraint(model, vcat(t, 0.5, vec(A + X)) in RotatedSecondOrderCone())
    @objective(model, Min, LinearAlgebra.tr(W * X) + fro_penalty * t)
    optimize!(model)
    @assert is_solved_and_feasible(model)
    return value.(X)
end

because Mosek doesn’t support quadratic objective functions.

Could you elaborate more here on why MosekTools.jl or Clarabel.jl would be preferable?

It’s entirely problem-dependent, and I didn’t know what problem you were solving. In general, Mosek is commercial software, and has, on average, better performance.

But your issue is the single dense 2000x2000 PSD matrix. That is hard to solve.

But, I would have guessed that the underlying linear systems solves would be amenable to parallelization

Mosek et al use parallelized linear algebra. It should use all available cores. Trying to parallelize this at the JuMP level across some sort of HPC is not necessary.

In fact, Clarabel.jl does not complete the first iteration

Try updating to Julia v1.10?

@pgoulart might have some suggestions.

ericphanson · April 2, 2024, 10:18pm

Mosek and Clarabel are both interior-point solvers IIUC; since it’s a big problem, maybe an another first-order method (like SCS) would work better? E.g. COSMO is ~2x faster for me on the MWE than SCS.

charshaw · April 3, 2024, 6:15pm

Thank you both for your advice!

I ran some extra tests yesterday. For the benchmark problem in the MWE with n taking values 50, 150, 250, 350, here is a summary of what I found:

COSMO and SCS run much faster than MosekTools and Clarabel. This is in line with distinction that @ericphanson made between first order methods and interior-point solvers.
COSMO is consistently faster than SCS, but this speed-up is less dramatic for larger problem instances.
Across all solvers, the formulation in the MWE runs faster than the two formulation that @odow proposed. The improvement of mwe_program over mwe_program2 is not substantial and a lot of it seems to come from the time that JuMP takes to create the model. There is a much more substantial improvement of mwe_program over mwe_program3 (where the quadratic objective is translated to a second order cone constraint) across all solvers. This is particularly surprising for Mosek, which does not natively support quadratic objectives. But I guess JuMP was working hard

I might try running Mosek on the HPC just to see if I get speed-ups from the extra cores. If I do this, I’ll report back.

These tests (and the one above) were conducted using Julia v1.10, so I don’t think that could be the issue for Clarabel taking a long time to run. As @ericphanson said, I’m willing to chalk this up to the difference between interior-point and first-order methods. But, I’m very happy to hear any advice that @pgoulart may have!

charshaw · April 3, 2024, 6:19pm

It seems that my original question has been answered in the negative in the following way:

Most JuMP-supported solvers (except perhaps Mosek?) do not natively support parallelization.
JuMP does not itself provide the capabilities to instruct the solver to enable distributed parallelization, e.g. on an HPC

If anyone else wants to chime in, that would also be greatly appreciated!

charshaw · April 3, 2024, 6:20pm

One last bit: I have one more question about implementing warm starts (through model modification) and setting initial points for this problem. It’s something I’ve been doing already, but want to make sure I’m doing it in the correct way. Would that be most appropriate to ask here or would it be best to open a new thread?

odow · April 3, 2024, 8:03pm

The other thing to try would be a different linear solver for SCS:

GitHub - jump-dev/SCS.jl: Julia Wrapper for SCS (https://github.com/cvxgrp/scs)

JuMP does not itself provide the capabilities to instruct the solver to enable distributed parallelization

Perhaps just to clarify, all JuMP does is reformulate the model you write down into the data structures required by the solvers. It has no control over the algorithms that solvers use to solve the problems. So the answer is “no” but it is also not possible for this answer to be “yes” if the solver does not natively support such parallelism.

odow · April 3, 2024, 8:05pm

I have one more question about implementing warm starts (through model modification) and setting initial points for this problem

See Primal and dual warm-starts · JuMP, but unless you are warmstarting with a primal/dual feasible solution, solvers likely aren’t exploiting the partial starts.

Maybe start a new question if you have follow-ups.

pgoulart · April 5, 2024, 9:36am

Agree with all that has been said so far, and that for a large SDP in particular it is often the case that a first-order method like SCS or COSMO will be faster.

Note that for the first-order methods the main driver of computation time in an SDP is not the time taken to factorization a KKT system, but rather the time it takes to project matrices onto the PSD cone. This is ~O(n^3) in the matrix dimension, so particularly bad if you have one really huge, dense block.

COSMO implements a chordal decomposition method to try to break those blocks into smaller pieces, but it will only work if the problem has some kind of inherent sparsity. You could check the output log for the solver to see if had success doing that. If so, then some additional speedup could come from also parallelizing the individual matrix projections. I’m not sure if the solver currently supports that though.

Another possibility is to apply an approximate PSD projection method within a first-order solver. We experimented with that method with some success, and had a branch at one point that implemented it in COSMO. I am again unsure if that ever made it in the main branch though.

abulak · April 5, 2024, 12:06pm

To the contrary of what’s been said scs does multi threading and you CAN access it from julia

have a look here:

github.com/jump-dev/SCS.jl

Comment by kalmarek - add support for sparse-direct-mkl-pardiso solver

jump-dev:master ← jump-dev:mk/mkl_direct

Numerically `DirectSolver` and `MKLDirectSolver` behave the same, i.e. the times… below are for the same number (`20_000`) of iterations. On a small problem I get ``` problem: variables n: 2640, constraints m: 5499 cones: z: primal zero / dual free vars: 2860 s: psd vars: 2639, ssize: 10 settings: eps_abs: 1.0e-10, eps_rel: 1.0e-10, eps_infeas: 1.0e-07 alpha: 1.90, scale: 1.00e-01, adaptive_scale: 1 max_iters: 20000, normalize: 1, rho_x: 1.00e-06 acceleration_lookback: 50, acceleration_interval: 10 lin-sys: sparse-direct-amd-qdldl nnz(A): 66208, nnz(P): 0 ``` * `43.913933 seconds (179.25 k allocations: 15.281 MiB)` vs * `20.055237 seconds (179.25 k allocations: 15.281 MiB)` i.e. a **2.19**-speed up; On a larger one ``` problem: variables n: 5708, constraints m: 11938 cones: z: primal zero / dual free vars: 6231 s: psd vars: 5707, ssize: 20 settings: eps_abs: 1.0e-10, eps_rel: 1.0e-10, eps_infeas: 1.0e-07 alpha: 1.90, scale: 1.00e-01, adaptive_scale: 1 max_iters: 20000, normalize: 1, rho_x: 1.00e-06 acceleration_lookback: 50, acceleration_interval: 10 lin-sys: sparse-direct-mkl-pardiso nnz(A): 275706, nnz(P): 0 ``` the speed-up is comparable (`MKLDirectSolver` is 2-3 times faster here). ---- These might be atypical examples for showing off `MKLDirectSolver`: these problems are after symmetry reduction so there's a rather small number of small psd constraints with a bunch of **dense** linear constraints. The original version (with large psd constraint and lots of sparse linear constraints) of the first (small) problem is: ``` problem: variables n: 93962, constraints m: 169158 cones: z: primal zero / dual free vars: 75197 s: psd vars: 93961, ssize: 1 settings: eps_abs: 1.0e-10, eps_rel: 1.0e-10, eps_infeas: 1.0e-07 alpha: 1.90, scale: 1.00e-01, adaptive_scale: 1 max_iters: 20000, normalize: 1, rho_x: 1.00e-06 acceleration_lookback: 50, acceleration_interval: 10 lin-sys: sparse-direct-amd-qdldl nnz(A): 746426, nnz(P): 0 ``` `DirectSolver` runs in * `1296` seconds (1 thread), * `1452` seconds (4 threads) vs `MKLDirectSolver` * `1578` seconds (1 thread) * `1012` seconds (4 threads) so `MKLDirectSolver` benefits from multiple threads (`OMP_NUM_THREADS`) while `DirectSolver` doesn't. Simply by looking at `htop` the `DirectSolver` seems to waste most of the resources (the occupied cores are predominantly in `sys: wait` state -- maybe some problem with synchronization/communication?). `MKLDirectSolver` fares much better here fully utilizing all available resources. Tbh I hoped for something much better ;) But maybe those timings/findings are useful for @bodono as well. ``` julia> versioninfo(verbose=true) Julia Version 1.8.2 Commit 36034abf260 (2022-09-29 15:21 UTC) Platform Info: OS: Linux (x86_64-linux-gnu) "Arch Linux" uname: Linux 6.0.6-arch1-1 #1 SMP PREEMPT_DYNAMIC Sat, 29 Oct 2022 14:08:39 +0000 x86_64 unknown CPU: AMD Ryzen 7 PRO 4750U with Radeon Graphics: speed user nice sys idle irq #1-16 1387 MHz 311723 s 3636 s 175687 s 1813700 s 1 s Memory: 30.586448669433594 GB (15411.67578125 MB free) Uptime: 117173.85 sec Load Avg: 4.83 4.28 3.24 WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-13.0.1 (ORCJIT, znver2) Threads: 8 on 16 virtual cores Environment: JULIA_NUM_THREADS = 8 [...] ```

TLDR; set OMP_NUM_THREADS before your first solve to use multi threading (I guess in calls from scs to BLAS). You will see some gains with MKLDirectSolver (up to 8 threads, depending on the size of your problem), but in my experience 2-3 speedup is all you can expect.

btw @odow is there a particular reason why do we compile scs without openmp?

odow · April 5, 2024, 9:35pm

TLDR; set OMP_NUM_THREADS before your first solve to use multi threading

Do we need to document this in the README?

is there a particular reason why do we compile scs without openmp?

Not that I remember.

charshaw · April 5, 2024, 11:11pm

Thanks for your reply, Marek! I’d be very curious to try using multi-threading with SCS. Even a 2-3 speedup would be helpful for me.

TLDR; set OMP_NUM_THREADS before your first solve to use multi threading

I’ve read through the commit comments that you’ve linked to. But, I haven’t quite gotten multi-threading with the MKLDirectSolver to work. Do you mind providing a small example that I can use to build off of?
Including this in the README would be very helpful!

charshaw · April 5, 2024, 11:14pm

thanks for your feedback, Paul! It’s very helpful to hear what COSMO is doing and also that parallelizing over blocks might get us some speed-ups.

Unfortunately, I do not think that we have much block sparsity in our problem instances, but let me try COSMO on our real instances (not the toy data above) and see what happens.

abulak · April 10, 2024, 10:54pm

@charshaw I modified the script a bit to run mwe twice; here are my results of @time optimize! within:

for n in {1..8} do                
OMP_NUM_THREADS=$n julia using_JuMP_multithreading.jl;

I fixed n = 450;

passing

scs_opt = optimizer_with_attributes(
    SCS.Optimizer,
    "linear_solver"=>SCS.MKLDirectSolver,
    "eps_abs"=>1e-8, # to increase the iteration count
    "verbose"=>false,
)

to mwe_program I get

Creating the problem instance... n = 450, OMP_NUM_THREADS = 1
  7.061236 seconds (1.04 k allocations: 90.613 MiB)

Creating the problem instance... n = 450, OMP_NUM_THREADS = 2
  5.475805 seconds (1.04 k allocations: 90.613 MiB)

Creating the problem instance... n = 450, OMP_NUM_THREADS = 3
  5.021955 seconds (1.04 k allocations: 90.613 MiB)

Creating the problem instance... n = 450, OMP_NUM_THREADS = 4
  4.565751 seconds (1.04 k allocations: 90.613 MiB)

Creating the problem instance... n = 450, OMP_NUM_THREADS = 5
  4.389557 seconds (1.04 k allocations: 90.613 MiB)

Creating the problem instance... n = 450, OMP_NUM_THREADS = 6
  4.413405 seconds (1.04 k allocations: 90.613 MiB)

Creating the problem instance... n = 450, OMP_NUM_THREADS = 7
  4.360352 seconds (1.04 k allocations: 90.613 MiB)

Creating the problem instance... n = 450, OMP_NUM_THREADS = 8
  4.339999 seconds (1.04 k allocations: 90.613 MiB)

so about 1.6 faster leveling around 4-5 threads. This comes from the following two lines: Code search results · GitHub

So either

you have plenty of psd constraints and project in parallel (mwe has only one), or
the (sparse CSC) matrix A defining the problem has enough density so that parallelizing A'x over its columns is beneficial.

Note: I don’t observe the same scaling with SCS.DirectSolver, solve time stays around 7s.

abulak · April 10, 2024, 10:55pm

I don’t like documenting stuff that I don’t understand

PR enabling openmp is on the way!

abulak · April 12, 2024, 10:24pm

and as of now (with [SCS] enable openmp by kalmarek · Pull Request #8473 · JuliaPackaging/Yggdrasil · GitHub merged) you should have SCS jlls officially linked against OpenMP!

charshaw · April 15, 2024, 9:28pm

Thanks a ton, I really appreciate your help! I’m really excited about running this code in a distributed fashion and hopefully getting lots of speedup!

It’s asking a lot, but do you mind providing me with a small example on how I can use the new OpenMP capabilities in SCS? I’m unsure, for example, how I should instruct Julia / JuMP / SCS that the SCS solver has access to multiple cores.

In the meantime, I’ll try to replicate your multi-threading example on my machine (and then the HPC). Will report back!

charshaw · April 15, 2024, 11:14pm

thanks for the detailed answer on how to use multi-threading!

When I went to replicate your code, I ran into an issue when I was loading the correct packages. I thought it made sense to open a new thread, which I did here. If you get a chance, I’d love to know your take on what is going wrong! Thanks as always. Please disregard – this was my fault for not reading the documentation closely enough. I’ll try to get it running on a Linux machine! Will report back

Topic		Replies	Views
MultiJuMP.jl: more than 3 hours' calculation for nonlinear example Optimization (Mathematical)	4	375	July 6, 2022
SCS and Mosek give very different answers Optimization (Mathematical) question , convex-optimization	6	633	May 27, 2024
Best free SDP, JuMP supported solvers? Optimization (Mathematical) jump	8	955	April 26, 2024
SCS Indirect Solver Parallelized? Optimization (Mathematical)	5	476	May 30, 2022
Sudden stop of parallel optimization with JuMP and Juniper Optimization (Mathematical) jump , parallel	8	814	July 23, 2021

Multi-core parallel support for JuMP supported solvers?

Related topics