Nested optimizations, vars traverse nesting

Someone asked for my help with this, however it is outside my experience.

I am told the outer maximization is a quadratic in αi and βi and
the inner minimization is solvable by e.g gradient descent

maximize(
        mean(
                  minimize(
                       each Lᵢ by varying αᵢ, βᵢ (i in 1:N) 
                  )
         ) 
         by varying λₜ, ηₜ (t in 1:T) 
)

N people are observed at T times.
data and variables associated with a person 1:N use subscript **i**.
data and variables associated with a time 1:T use subscript **t** .

For *each person* 1:N there are
    two variables, each with N subscripts, to be fit or solved
        αᵢ βᵢ

For *each time* 1:T there are
    two sets of six known values (constant Float64s, available in array[s])
       c1ₜ c2ₜ c3ₜ c4ₜ c5ₜ c6ₜ
       d1ₜ d2ₜ d3ₜ d4ₜ d5ₜ d6ₜ
    two variables, each with T subscripts, to be fit or solved 
        λₜ  ηₜ

There are six functions, f1..f6 of the two 1:T-subscripted variables 

f1(λₜ, ηₜ) = c1ₜ * λₜ + d1ₜ * ηₜ
f2(λₜ, ηₜ) = c2ₜ * λₜ + d2ₜ * ηₜ
f3(λₜ, ηₜ) = c3ₜ * λₜ + d3ₜ * ηₜ
f4(λₜ, ηₜ) = c4ₜ * λₜ + d4ₜ * ηₜ
f5(λₜ, ηₜ) = c5ₜ * λₜ + d5ₜ * ηₜ
f6(λₜ, ηₜ) = c6ₜ * λₜ + d6ₜ * ηₜ
    
There are six wrapper functions g1..g6 wrapping f1..f6
    and taking as new arguments the two 1:N-subscripted variables

g1(λₜ, ηₜ, αᵢ, βᵢ) = f1(λₜ, ηₜ) * αᵢ^2  == (c1ₜ * λₜ + d1ₜ * ηₜ) *  αᵢ^2
g2(λₜ, ηₜ, αᵢ, βᵢ) = f2(λₜ, ηₜ)  * βᵢ^2 == (c2ₜ * λₜ + d2ₜ * ηₜ) *  βᵢ^2
g3(λₜ, ηₜ, αᵢ, βᵢ) = f3(λₜ, ηₜ)  * (αᵢ * βᵢ) == (c3ₜ * λₜ + d3ₜ * ηₜ) * (αᵢ * βᵢ)
g4(λₜ, ηₜ, αᵢ, βᵢ) = f4(λₜ, ηₜ)  * αᵢ == (c4ₜ * λₜ + d4ₜ * ηₜ) * αᵢ
g5(λₜ, ηₜ, αᵢ, βᵢ) = f5(λₜ, ηₜ)  * βᵢ == (c5ₜ * λₜ + d5ₜ * ηₜ) * βᵢ 
g6(λₜ, ηₜ, αᵢ, βᵢ) = f6(λₜ, ηₜ)  == (c6ₜ * λₜ + d6ₜ * ηₜ)

The g functions are combined into N sums (the args are elided for brevity)
     Lᵢ = L(λₜ, ηₜ, αᵢ, βᵢ) =
           g1(λₜ, ηₜ, αᵢ, βᵢ) + g2(λₜ, ηₜ, αᵢ, βᵢ) + g3(λₜ, ηₜ, αᵢ, βᵢ) +
           g4(λₜ, ηₜ, αᵢ, βᵢ) + g5(λₜ, ηₜ, αᵢ, βᵢ) + g6(λₜ, ηₜ, αᵢ, βᵢ)

each of the Lᵢ are minimized, so determining the αᵢ, βᵢ

The mean of the Lᵢ is determined
this mean is maximized, so determining the λₜ, ηₜ

maximize(
        mean(
                  minimize(
                       each Lᵢ by varying αᵢ, βᵢ (i in 1:N) 
                  )
         ) 
         by varying λₜ, ηₜ (t in 1:T) 
)
                    
1 Like

It would prob be easier to help w a much more minimal MWE

1° For each t, each i, the minimzation in \alpha,\beta is a quadratic program without constraints, and thus if you express it as min x'P*x - 2q'x + c the solution is x = P \ q if I remember correclty, which is thus given in closed form from your parameters, and the value of the objective function at the minimum is then -q'(P^(-1))*q +c. It will be a rational function in f1,...f6 due to the inverse of the 2x2 matrix P. You could derive it by hand with a paper and pen, as I just did.

2° Then, the outer maximization is therefore the maximization of a rational function. There are SumOfSquare solvers in JuMP that gives you global maximums, but if you dont care a non-linear solver will probably be enough.

My advice is as follow: express the value of the inner problem via Symbolics.jl to inverse the matrix P symboically in \lambda,\eta. It will give you an expression of the loss for the maximization problem, which you can try to analyse before giving it to a black box solver.

Thank you. This sounds like good advice. I will try that.

Good people, after much time and effort, this is the essence of our optimization problem (which belongs to graduate student in Economics). I have assisted as best I could. I asked her to revisit our Community’s kind expertise.

This is essentially a continuous minimax problem (technically “maximin”, but that’s just a sign flip), or more generally a bilevel optimization problem, and there are a bunch of algorithms in the literature.

If the inner function f(\cdot) is just some arbitrary (hopefully smooth) non-convex function, then presumably you are just hoping to do local optimization, probably with some variant of gradient descent? In that case you can use nested gradient-based optimization, with implicit differentiation to differentiate the \max (\cdots) outer objective with respect to the \alpha_t parameters.

(At some point, once things start getting sufficiently complicated, I would tend to implement the objective functions “manually”, maybe with some help from AD tools to get derivatives, and then call optimization packages directly, rather than trying to use a very high-level interface like JuMP.)

forwarding:
"f(_) is a quadratic polynomial in beta_i (and so has a global optimum). The outer problem has a global optimum as well; using gradient descent for it would be fine.

My question is literally how I would set it up using Julia optimization packages, especially given that the fact the inner minimization is solved separately for each i and then averaged (which I imagine introduces a slightly kink)."

if f(\beta) = a(\alpha, x_i) \beta^2 + 2b(\alpha, x_i) \beta + c(\alpha, x_i), with a > 0 (convex), then isn’t this trivial? The minimum is simply at \beta = -b/a, so \min_\beta f(\beta) = c - b^2/a. More generally, if \beta_i is a vector and you have a convex quadratic system \beta^T A \beta + 2\beta^T b + c with an SPD matrix A, then the minimum is c - b^T A^{-1} b. So, the problem reduces to:

\max_\alpha \underbrace{ \left(\frac{1}{N}\sum_i \left[ c_i(\alpha) - b_i(\alpha)^T A_i(\alpha)^{-1} b_i(\alpha) \right] \right) }_{g(\alpha)}

which can be straightforwardly maximized given AD-able functions A,b,c. (Here, I’m using \alpha to denote the vector of unknowns \{\alpha_t\}.)

Asking people to literally write your code for you is too much. You should make an attempt using the above hints, and ask a more specific question when you get stuck.

My advice would be

  1. Define the function g(\alpha) in Julia.
  2. Try to compute its gradient \nabla_\alpha g with an AD package like Zygote. If that fails, break it into simpler pieces and see which piece is failing, then ask about that. (Or just work out the gradient analytically — it’s not that hard if you know how to differentiate a matrix inverse.)
  3. Once you have g and \nabla_\alpha g, try one of the many gradient-based optimization packages in Julia. If you get stuck on that, ask a more specific question.
1 Like