Is it possible that numerical optimisation solve time decreases for some range of dimensions from low to moderate levels?

Hi,
I encountered a weird phenomenon of solve time calculation in numerical optimisation.

Backgrounds

There is a bunch of approximators, \{ f_j \}_{j \in J}. For example, f_1 is a feedforward (dense) neural network, and f_2 is an another approximator, …, f_{|J|} is an approximator that I proposed.

In my problem settings, I’m considering that approximators receive two arguments as f_j(x, u).
Each approximator is trained to approximate a given target function g in advance (the symbol for approximator parameters are omitted here, e.g., \theta in f^{\theta}_j(x, u)).

Then, I wanna compare the solve times to minimise each approximator f over the second argument while fixing the first argument, that is, for given x \in X,

\text{min}_{u \in U} f(x, u) \quad (1)

I found that the solve time decreases when the dimensions of x and u increase when minimising some approximators.
For example, suppose that X \subset \mathbb{R}^n, U \subset \mathbb{R}^m.
I calculated the mean solve time for given set \{ x_i \in X \}_{i \in I}.
Let t_i be the solve time of (1) for x_i.
Then, the mean solve time is \frac{1}{|I|} \sum t_i (I=500 for the below simulation result).

I expected that the mean solve time should increase as n and m increase, but it didn’t! Especially, n=1 and m=1 took a longer time than n=13 and m=4 (of course, the solve time for too big n and m, e.g., n=376, m=17, is longer than that of n=m=1).
In short, n=m=1 took longer time than similar case with moderate dimensions.

Questions

I wonder if it’s possible, and why it is.
I’ve suspected that

  1. 'cause each approximator is trained to approximate the target function g, it may influence the β€œcondition of the trained approximator for numerical optimisation”. For some reason, n=m=1 took extraordinarily longer time than others.
  2. For some reason, optimisation solvers (especially the convex optimisation solvers) take different strategies or tolerance, etc. for the scalar case (m=1). Or, they need some special cares that I’ve not figured out.
  3. Just my mistake; invalid comparison scripts due to the lack of my background for solve time comparison.

Please share your opinion :slight_smile:

For now, I share the original script for solve time calculation to gain some ideas to interpret this phenomenon and to have some time to properly deliver my situation to other users. To run this, include it and run main function with specified dimensions of n and m.
If necessary, I’ll write other simpler scripts as requested.

Notes

How to run?

  • Run as
include("test/basic.jl")
main(1, 1); main(13, 4); main(376, 17)  # to avoid JIT compilation time, I measured the solve time from the second run for each dimension pair

Results

As the below result, you might notice that for approximator LSE, PICNN, PMA, and PLSE, the solve time (denoted by optimise_time_mean) for n=m=1 is longer than that for n=13 and m=4.

  • (n, m) = (1, 1)
Row β”‚ n      m      epochs  approximator  optimise_time_mean  minimisers_diff_norm_mean  optvals_diff_abs_mean  no_of_minimiser_success  no_of_optval_success  number_of_parameters
     β”‚ Int64  Int64  Int64   String        Float64             Float64                    Float64                Int64                    Int64                 Int64
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 β”‚     1      1     100  FNN                  0.000669363                 0.0339567              0.00835098                      500                   500                  4417
   2 β”‚     1      1     100  MA                   0.000423145                 0.0517864              0.131502                        500                   500                    90
   3 β”‚     1      1     100  LSE                  0.0377195                   0.00626931             0.126809                        500                   500                    90
   4 β”‚     1      1     100  PICNN                0.0104589                   0.0512426              0.100103                        500                   498                 25608
   5 β”‚     1      1     100  PMA                  0.00109487                  0.320571               0.0189876                       500                   500                  8188
   6 β”‚     1      1     100  PLSE                 0.00626701                  0.00967128             0.00171072                      500                   500                  8188
  • (n, m) = (13, 4)
 Row β”‚ n      m      epochs  approximator  optimise_time_mean  minimisers_diff_norm_mean  optvals_diff_abs_mean  no_of_minimiser_success  no_of_optval_success  number_of_parameters
     β”‚ Int64  Int64  Int64   String        Float64             Float64                    Float64                Int64                    Int64                 Int64
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 β”‚    13      4     100  FNN                  0.01256                      0.994461              0.178648                        500                   500                  5377
   2 β”‚    13      4     100  MA                   0.000882126                  0.219979              0.0636181                       500                   500                   540
   3 β”‚    13      4     100  LSE                  0.0222627                    0.056159              0.0360104                       500                   500                   540
   4 β”‚    13      4     100  PICNN                0.00859685                   0.366883              0.0343947                       500                   500                 27987
   5 β”‚    13      4     100  PMA                  0.000694896                  0.498253              0.015404                        500                   500                 14806
   6 β”‚    13      4     100  PLSE                 0.00622943                   0.0568668             0.00995621                      500                   500                 14806
  • (n, m) = (376, 17)
 Row β”‚ n      m      epochs  approximator  optimise_time_mean  minimisers_diff_norm_mean  optvals_diff_abs_mean  no_of_minimiser_success  no_of_optval_success  number_of_parameters
     β”‚ Int64  Int64  Int64   String        Float64             Float64                    Float64                Int64                    Int64                 Int64
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 β”‚   376     17     100  FNN                   0.0292232                     2.95216              0.165991                       500                   500                 29441
   2 β”‚   376     17     100  MA                    0.010935                      4.09744              0.125646                       500                   500                 11820
   3 β”‚   376     17     100  LSE                   0.243563                      4.01312              0.115364                       500                   500                 11820
   4 β”‚   376     17     100  PICNN                 0.0734538                     3.78249              0.131185                       500                   500                 84534
   5 β”‚   376     17     100  PMA                   0.0164681                     3.51627              0.0878748                      500                   500                 63388
   6 β”‚   376     17     100  PLSE                  0.00690463                    0.5204               0.0815478                      500                   500                 63388

Brief summary for each approximator

Approximator list

FNN: a standard feedforward neural network
MA: max-affine (MA) neural network [1]
LSE: log-sum-exp (LSE) neural network [1]
PMA: parametrised MA (proposed from my research)
PLSE: parametrised LSE (proposed from my research)

Notes

To calculate the solve time, FNN uses interior-point Newton method with box constraints from Optim.jl 'cause it’s non-convex wrt the second argument in general.
For other approximators, convex optimisation is used with SCS.jl and Convex.jl.
Note that MA and LSE are convex approximators and were not proposed for the two-argument setting, but it is obviously convex wrt the second argument as a projected convex function is also convex.

Refs

[1] G. C. Calafiore, S. Gaubert, and C. Possieri, β€œLog-Sum-Exp Neural Networks and Posynomial Models for Convex and Log-Log-Convex Data,” IEEE Transactions on Neural Networks and Learning Systems , vol. 31, no. 3, pp. 827–838, Mar. 2020.

2 Likes

I don’t think this is as mysterious as you seem to assume.

Assuming you are doing gradient-based optimization (as opposed to derivative-free optimization), there is no general simple relationship between the dimension n of the design variables (u) and the solve time. The number of optimization steps depends mainly on the function you are optimizing (e.g. how wiggly the steepest-descent path is), though the computational cost of each step does generally grow with n (often proportionally, \sim n).

For example, if you are minimizing the quadratic function f(u) = u^T A u - u^T b by steepest-descent iterations (for a symmetric positive-definite n \times n matrix A), then the number of iterations is proportional to the condition number of A. So, even though the cost of each iteration scales like \sim n^2 (from the matrix–vector multiply, or like \sim n if A is very sparse), you could easily devise a small-n problem that converges much more slowly than a large-n problem.

(Even for derivative-free optimization, badly conditioned low-dimensional problems could conceivably converge more slowly than well-conditioned high-dimensional problems. But for derivative-free methods there is a scaling proportional to the dimension n in the number of optimization steps, because without access to the gradient an optimization algorithm must necessarily explore what each optimization variable is doing using \sim n steps.)

4 Likes

Thank you so much, @stevengj! It’s really helpful :slight_smile:

To clarify my situation, the target function with each dimension in my settings is similar as
g(x, u) = -x^{T}x + u^{T}u for each (n, m) pair where X =[-1, 1]^n \subset \mathbb{R}^n and U =[-1, 1]^m \subset \mathbb{R}^m.

Then, do you think the following summary is reasonable?

  1. As observed in your quadratic function minimisation problem, the solve time depends not only on the dimension (due to the matrix-vector multiplication) but also on β€œthe condition” for given problem (for the quadratic function case, the condition refers to the condition number of matrix A).
    That is, even for the same type of approximator, the approximators for each dimension may not have a clear tendency with respect to the dimension.
  2. And the above thought can be more persuasive if the solver (SCS) is a gradient-based method rather than it is a gradient-free method.

And as you said the solve time depends on β€œthe condition” and the dimension, I guess that even the target function g is similar for each dimension the condition of β€œeach trained approximator” may be different from each dimension:

As a result, the solve time issue that I raised seems not that mysterious and difficult to determine certain tendencies as it is hard to be guaranteed that the optimisation problem’s condition is similar for each dimension as each approximator is trained to approximate the target function before the numerical optimisation, which definitely depends on the dimension, the type of approximator, the number of hyperparameters, the number of training data, and so on.

This is equivalent to A=I in my example, which has condition number equal to 1; some algorithms (e.g. steepest-descent) would actually converge in a single step regardless of the dimension. (Moreover, the minimum is attained for u=0, but I guess this is a toy case for your real problem?)

3 Likes

Oh, I’m sorry. It should be g(x, u) = -\frac{x^Tx}{n} + \frac{u^Tu}{m}. I’m really sorry for providing wrong information.

In this case, is the condition number 1/m? (and it should be pointed out that the optimisation is not performed with g, but f_j; because of it, each approximator shows a ddifferent result. For example, some approximators show high errors of minimiser (u=0) and minimum value (-x^T x). Nonetheless, for clearer comparison, it would be better to adjust the target function to have the same condition number for each dimension like g(x, u) = -x^Tx + u^Tu.)

Then, it seems to make sense to me, well (the solve time things)…

Note: It is a toy example but there is no other problem. The aim of the current research is β€œhow well the proposed approximators approximate given functions in a certain class” (in theory) and β€œwhether it is suitable for optimisation (with fixed the first argument), e.g., short solve times”. To see this, I’m testing with the toy example.

Nope. An overall scale factor doesn’t change the condition number.

See e.g. Trefethen & Bau, Numerical Linear Algebra.

2 Likes

Thank you. I’ll read it. Do you refer to "relative condition number \kappa"?
And did you refer to the condition number for square and invertible matrix A as follows [1]?

Is the fact mentioned by you that iteration steps are proportional to the condition number of A only valid for quadratic programming (QP)? If so, it is hard to find such relations as each approximator does probably not form a QP. Otherwise, is there Julia package or function to get a conditional number of a given function?

And I would appreciate if you share some texts that explain the relation between β€œconditions” and iteration number for various (especially convex) optimisation problems, for example, QP, piecewise linear function minimisation, etc.

Refs

[1] L. N. Trefethen and D. Bau, Numerical linear algebra . Philadelphia: Society for Industrial and Applied Mathematics, 1997.

@stevengj BTW, you’re the author of PyCall.jl!
Thank you for the very useful package :slight_smile:

The general definition of the condition number that you refer to does not specify which norm is used. If 2-norm is considered, which is the common case, the condition number is a ratio of the largest and the smallest singular values. For symmetric real matrices it is given as a ratio of the absolute values of the largest and the smallest eigenvalues.

julia> using LinearAlgebra

julia> A = [1.0 0; 0 1000.0]
2Γ—2 Matrix{Float64}:
 1.0     0.0
 0.0  1000.0

julia> cond(A)
1000.0

General nonlinear functions (well, not completely general but those for which at least the first and second derivatives exist) are locally characterized by the first and second derivatives. For scalar functions of vector arguments the second derivatives form a matrix – Hessian matrix. This matrix determines how the function is curved locally. If this matrix is ill-conditioned, it means that the function is (locally) curved very β€œwildly”, which makes the basic steepest descent method(s) converge slowly. You can now examine the conditioning of the Hessian matrix of some notoriosly difficult functions such as Rosenbrock function. Go ahead, compute its Hessian and check how ill-conditioned it is. This will indicate that indeed the function constitutes a challenge for (some) algorithms.

2 Likes

Thank you, @zdenek_hurak !

Yeah, I understood :slight_smile: I’ll try it.

One more question: As you said, one can see the condition number of Hessian to examine the local condition of a given function. In this case, at which point do I need to compute the Hessian? At the approximate local minimiser in interest? (β€œapproximate” in the case of that one cannot easily access the true minimiser)

@zdenek_hurak @stevengj
For now, I computed Hessian at each minimiser. More precisely, for each dimension pair (n, m), I randomly sampled some testing points \{ x_i \in X =[-1, 1]^n \}_{i \in I}. Then, for an approximator f_j (in this case, LSE), I computed Hessians via ForwardDiff as
H_i = H(f_j(x_i, u)) |_{u = \hat{u}^{*}_i}
where \hat{u}^{*}_i denotes the minimiser obtained from numerical optimisation for given x_i.

Then, for all x_i's, I computed the mean of condition number of Hessians H_i as
\frac{1}{|I|} \sum \text{cond}(H_i).

As shown below, the mean condition number is 1 and 1.31 for (n, m) = (1, 1), (13, 4), respectively. Mean solve times are 0.05 and 0.02 s, resp.

So… as I’ve read your comments, it’s an unexpected result, right?

I’m gonna try other things, e.g., adjust dimensions, training epochs, and training data.

Please give me any advice if you have any ideas.

Thank you as always :slight_smile:

Note: source code

EDIT: for sufficiently big m = 17, the order of condition number is changed (from ~1 to ~10), so it seems to make a difference.
But for small m = 1 \sim 4, well… I cannot figure out the tendency by now.

julia> main(1, 1)
df = 1Γ—11 DataFrame
 Row β”‚ n      m      epochs  approximator  optimise_time_mean  minimisers_diff_norm_mean  optvals_diff_abs_mean  no_of_minimiser_success  no_of_optval_success  number_of_parameters  cond_hessian_mean
     β”‚ Int64  Int64  Int64   String        Float64             Float64                    Float64                Int64                    Int64                 Int64                 Float64
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 β”‚     1      1     100  LSE                    0.0509796                 0.00300408               0.129234                      500                   500                    90                1.0

julia> main(1, 2)
df = 1Γ—11 DataFrame
 Row β”‚ n      m      epochs  approximator  optimise_time_mean  minimisers_diff_norm_mean  optvals_diff_abs_mean  no_of_minimiser_success  no_of_optval_success  number_of_parameters  cond_hessian_mean
     β”‚ Int64  Int64  Int64   String        Float64             Float64                    Float64                Int64                    Int64                 Int64                 Float64
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 β”‚     1      2     100  LSE                    0.0457618                   0.103397               0.128137                      500                   500                   120            1.42656
1Γ—11 DataFrame
 Row β”‚ n      m      epochs  approximator  optimise_time_mean  minimisers_diff_norm_mean  optvals_diff_abs_mean  no_of_minimiser_success  no_of_optval_success  number_of_parameters  cond_hessian_mean
     β”‚ Int64  Int64  Int64   String        Float64             Float64                    Float64                Int64                    Int64                 Int64                 Float64
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 β”‚     1      2     100  LSE                    0.0457618                   0.103397               0.128137                      500                   500                   120            1.42656

julia> main(13, 4)
df = 1Γ—11 DataFrame
 Row β”‚ n      m      epochs  approximator  optimise_time_mean  minimisers_diff_norm_mean  optvals_diff_abs_mean  no_of_minimiser_success  no_of_optval_success  number_of_parameters  cond_hessian_mean
     β”‚ Int64  Int64  Int64   String        Float64             Float64                    Float64                Int64                    Int64                 Int64                 Float64
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 β”‚    13      4     100  LSE                    0.0239312                  0.0714554              0.0356601                      500                   500                   540            1.31084
df = 1Γ—11 DataFrame
 Row β”‚ n      m      epochs  approximator  optimise_time_mean  minimisers_diff_norm_mean  optvals_diff_abs_mean  no_of_minimiser_success  no_of_optval_success  number_of_parameters  cond_hessian_mean
     β”‚ Int64  Int64  Int64   String        Float64             Float64                    Float64                Int64                    Int64                 Int64                 Float64
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 β”‚     1     17     100  LSE                    0.0729302                    1.43044               0.142975                      500                   500                   570             160.24
julia> main(1, 17)
1Γ—11 DataFrame
 Row β”‚ n      m      epochs  approximator  optimise_time_mean  minimisers_diff_norm_mean  optvals_diff_abs_mean  no_of_minimiser_success  no_of_optval_success  number_of_parameters  cond_hessian_mean
     β”‚ Int64  Int64  Int64   String        Float64             Float64                    Float64                Int64                    Int64                 Int64                 Float64
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 β”‚     1     17     100  LSE                    0.0729302                    1.43044               0.142975                      500                   500                   570             160.24

I’ve tried many things but I don’t still get a clear explanation.

So I saw the solver’s status as follows.
The average solve time for lin-sys of (n, m) = (1, 1) is smaller than that of (n, m) = (13, 4), but it took more time (on average) for the projection, which seems dominant.

I’m not familiar with the details of convex optimisation solver, so I don’t know why there is such differences.

EDIT:
IMO, even for relatively higher-dimensional cases, it may be possible to take a smaller total solve time if the order of dimensions is similar. Especially for simple problems, the hyperparameters of the solver (e.g., tolerance) would affect the solve time.
It seems beyond the scope of my research… :frowning:

  • (n, m) = (1, 1)
----------------------------------------------------------------------------
        SCS v2.1.2 - Splitting Conic Solver
        (c) Brendan O'Donoghue, Stanford University, 2012
----------------------------------------------------------------------------
Lin-sys: sparse-direct, nnz in A = 124
eps = 1.00e-05, alpha = 1.50, max_iters = 5000, normalize = 1, scale = 1.00
acceleration_lookback = 10, rho_x = 1.00e-03
Variables n = 33, constraints m = 94
Cones:  primal zero / dual free vars: 1
        linear vars: 3
        exp vars: 90, dual exp vars: 0
Setup time: 1.37e-04s
----------------------------------------------------------------------------
 Iter | pri res | dua res | rel gap | pri obj | dua obj | kap/tau | time (s)
----------------------------------------------------------------------------
     0| 1.01e+20  1.23e+20  4.53e-01 -3.29e+21 -1.24e+21  2.86e+20  3.02e-04
    60| 5.05e-09  4.20e-10  1.69e-08 -1.77e-01 -1.77e-01  2.53e-15  5.22e-02
----------------------------------------------------------------------------
Status: Solved
Timing: Solve time: 5.22e-02s
        Lin-sys: nnz in L factor: 253, avg solve time: 8.70e-07s
        Cones: avg projection time: 8.44e-04s
        Acceleration: avg step time: 9.10e-06s
----------------------------------------------------------------------------
Error metrics:
dist(s, K) = 4.4832e-15, dist(y, K*) = 1.5087e-08, s'y/|s||y| = -2.4959e-10
primal res: |Ax + s - b|_2 / (1 + |b|_2) = 5.0488e-09
dual res:   |A'y + c|_2 / (1 + |c|_2) = 4.2031e-10
rel gap:    |c'x + b'y| / (1 + |c'x| + |b'y|) = 1.6866e-08
----------------------------------------------------------------------------
c'x = -0.1772, -b'y = -0.1772
============================================================================

  • (n, m) = (13, 4)
----------------------------------------------------------------------------
        SCS v2.1.2 - Splitting Conic Solver
        (c) Brendan O'Donoghue, Stanford University, 2012
----------------------------------------------------------------------------
Lin-sys: sparse-direct, nnz in A = 220
eps = 1.00e-05, alpha = 1.50, max_iters = 5000, normalize = 1, scale = 1.00
acceleration_lookback = 10, rho_x = 1.00e-03
Variables n = 36, constraints m = 100
Cones:  primal zero / dual free vars: 1
        linear vars: 9
        exp vars: 90, dual exp vars: 0
Setup time: 1.90e-04s
----------------------------------------------------------------------------
 Iter | pri res | dua res | rel gap | pri obj | dua obj | kap/tau | time (s)
----------------------------------------------------------------------------
     0| 8.60e+19  1.07e+20  3.57e-01 -1.92e+21 -9.10e+20  1.21e+21  3.14e-04
    40| 4.48e-09  3.44e-08  1.84e-08 -1.80e-01 -1.80e-01  2.17e-15  2.16e-02
----------------------------------------------------------------------------
Status: Solved
Timing: Solve time: 2.16e-02s
        Lin-sys: nnz in L factor: 376, avg solve time: 1.04e-06s
        Cones: avg projection time: 5.15e-04s
        Acceleration: avg step time: 9.45e-06s
----------------------------------------------------------------------------
Error metrics:
dist(s, K) = 2.5144e-15, dist(y, K*) = 1.0077e-07, s'y/|s||y| = -6.9238e-11
primal res: |Ax + s - b|_2 / (1 + |b|_2) = 4.4807e-09
dual res:   |A'y + c|_2 / (1 + |c|_2) = 3.4417e-08
rel gap:    |c'x + b'y| / (1 + |c'x| + |b'y|) = 1.8397e-08
----------------------------------------------------------------------------
c'x = -0.1801, -b'y = -0.1801
============================================================================