hello everyone,
I am trying to use JuMP to minimize an objective function, essentially logistic regression with both LASSO and Ridge regularization term. Here is the setting:
I have a matrix of inputs, X, having n rows and m columns. So there are n training examples, each with m features. Correspondingly, I have a vector y of length n, which are the true labels for the training examples.

Defining phi(i) = 1/(1 + exp(-Xi*theta)), where Xi is the i-th row of the matrix X, and theta is the vector of parameters, I have the following objective function to minimize over all theta:

I can optimize over the individual terms of L(theta), after converting minimization of norms to linear programs with constraints, but I can not seem to come up with the syntax to write the whole of L(theta) in JuMPâ€™s @objective(model, Min, <some function>).

I would greatly appreciate any help as to how to write this in JuMP. I am open to any other optimization library / wrapper too if it is too complicated to do in JuMP.

thanks so much for pointing it out. I ran a quick google to avoid me writing buggy code, and simply using the available libraries. The code @odow very kindly wrote works to an extent, but it terminates with an error â€śInvalid number in NLP function or derivative detectedâ€ť. That means there is some infinity of some sort (most likely divide by zero) somewhere but it is hard to debug.

While a quick Google didnâ€™t return any example, would you have some sample code for the implementation of this problem (or something similar) using MLJLinearModels?

en = LogisticRegression(a, b; penalty=:en)
fit(en, X, y)

where a is the coefficient of the L1 and b of the L2 penalty. These are un scaled (unlike sklearn) so you have to pay attention to that if you want to compare to sklearn (check the logistic tests in the code to see comparisons with sklearn if you want)

X, y are arrays of float

If you want something that works with data frames etc then use MLJ calling MLJLinearModels

wow! Thatâ€™s quite simple. Is the penalty parameter set to en is just an example and Iâ€™d need to figure out what penalty would be, or should I also set it to en?

Well you said elastic net? (:en is for that) If you want just l2 write :l2 same for l1; and then it takes only one parameter which is the coefficient for that penalty term.

By the way in my experience Elastic Net usually sucks and L1 is usually what we want but your mileage may vary

I tried looking at the documentation for MLJLinearModels at its GitHub repository here but couldnâ€™t find much. Can you please point me to its documentation, so I can be more aware of it and use it better. Thanks

What is your question? if youâ€™re looking for official docs, thereâ€™s a stub I wrote but itâ€™s probably not going to be of help.

However in your case my previous answer should be all you need: specify a :en penalty and two coefficients (one for the l1 and one for l2) and then fit.

If an error is thrown, open an issue on the GitHub repo with an example where it caused you trouble and Iâ€™ll help you there

I see. Thanks.
The question was to be certain that my function that I am trying to optimize is indeed the one optimized by this syntax. Second question, if I want to optimize different objectives, how to set those upâ€¦

The elastic net penalty implemented in the package is

A * || . || + B * || . ||

Where the first norm is l2 and second is l1, no scaling.

For other objectives itâ€™s the same concept, thereâ€™s a bunch of penalties you can use (Huber, Fair etc)

I know this is probably not as documented as youâ€™d like but if you look at the tests in the code thereâ€™s examples for pretty much all use cases that it can handle; if you struggle with a specific one, ask on GitHub I should be reasonably quick at answering

Thanks a lot. The program finishes with the following warning: â€śProximal GD did not converge in 1000 iterationsâ€ť. Is there a way to raise the number of maximum iterations?

yes but you can also broadly ignore these warnings (thereâ€™s an open issue that I should basically hide these warnings).

Note that when you use l1 (which is part of elastic net), setting the scale right is important (this is what itâ€™s complaining about). If you call MLJLM from within MLJ then you can have access to hyper parameter tuning to select these parameters in a principled way

For instance in scikitlearn, the parameter is divided by 2n where n is the number of rows in X; so if you set it to 1 the actual passed parameter is 5e-5 ; in MLJLM there is no automatic scaling so you have to do this yourself

I see. Still the actual values of the parameter vector I used to generate this data (values of y given X) were 0.5*ones(7), while the values returned after the algorithm finishes (1000 iterations) is
8-element Array{Float64,1}:
1.8316760931663576
1.933975804712437
1.8442303256269603
1.8576032361270116
1.9316497337449647
1.8108914710429065
1.990582963169875
10.63631447226224

So ignoring the intercept, the values are of the parameters can be thought to be not so close to the original ones. May be running the algorithm for longer, might get it to converge nearer to the actual values of the parameters?