hello everyone,
I am trying to use JuMP to minimize an objective function, essentially logistic regression with both LASSO and Ridge regularization term. Here is the setting:
I have a matrix of inputs, X, having n rows and m columns. So there are n training examples, each with m features. Correspondingly, I have a vector y of length n, which are the true labels for the training examples.
Defining phi(i) = 1/(1 + exp(-Xi*theta)), where Xi is the i-th row of the matrix X, and theta is the vector of parameters, I have the following objective function to minimize over all theta:
I can optimize over the individual terms of L(theta), after converting minimization of norms to linear programs with constraints, but I can not seem to come up with the syntax to write the whole of L(theta) in JuMP’s @objective(model, Min, <some function>).
I would greatly appreciate any help as to how to write this in JuMP. I am open to any other optimization library / wrapper too if it is too complicated to do in JuMP.
thanks so much for pointing it out. I ran a quick google to avoid me writing buggy code, and simply using the available libraries. The code @odow very kindly wrote works to an extent, but it terminates with an error “Invalid number in NLP function or derivative detected”. That means there is some infinity of some sort (most likely divide by zero) somewhere but it is hard to debug.
While a quick Google didn’t return any example, would you have some sample code for the implementation of this problem (or something similar) using MLJLinearModels?
en = LogisticRegression(a, b; penalty=:en)
fit(en, X, y)
where a is the coefficient of the L1 and b of the L2 penalty. These are un scaled (unlike sklearn) so you have to pay attention to that if you want to compare to sklearn (check the logistic tests in the code to see comparisons with sklearn if you want)
X, y are arrays of float
If you want something that works with data frames etc then use MLJ calling MLJLinearModels
wow! That’s quite simple. Is the penalty parameter set to en is just an example and I’d need to figure out what penalty would be, or should I also set it to en?
Well you said elastic net? (:en is for that) If you want just l2 write :l2 same for l1; and then it takes only one parameter which is the coefficient for that penalty term.
By the way in my experience Elastic Net usually sucks and L1 is usually what we want but your mileage may vary
I tried looking at the documentation for MLJLinearModels at its GitHub repository here but couldn’t find much. Can you please point me to its documentation, so I can be more aware of it and use it better. Thanks
What is your question? if you’re looking for official docs, there’s a stub I wrote but it’s probably not going to be of help.
However in your case my previous answer should be all you need: specify a :en penalty and two coefficients (one for the l1 and one for l2) and then fit.
If an error is thrown, open an issue on the GitHub repo with an example where it caused you trouble and I’ll help you there
I see. Thanks.
The question was to be certain that my function that I am trying to optimize is indeed the one optimized by this syntax. Second question, if I want to optimize different objectives, how to set those up…
The elastic net penalty implemented in the package is
A * || . || + B * || . ||
Where the first norm is l2 and second is l1, no scaling.
For other objectives it’s the same concept, there’s a bunch of penalties you can use (Huber, Fair etc)
I know this is probably not as documented as you’d like but if you look at the tests in the code there’s examples for pretty much all use cases that it can handle; if you struggle with a specific one, ask on GitHub I should be reasonably quick at answering
Thanks a lot. The program finishes with the following warning: “Proximal GD did not converge in 1000 iterations”. Is there a way to raise the number of maximum iterations?
yes but you can also broadly ignore these warnings (there’s an open issue that I should basically hide these warnings).
Note that when you use l1 (which is part of elastic net), setting the scale right is important (this is what it’s complaining about). If you call MLJLM from within MLJ then you can have access to hyper parameter tuning to select these parameters in a principled way
For instance in scikitlearn, the parameter is divided by 2n where n is the number of rows in X; so if you set it to 1 the actual passed parameter is 5e-5 ; in MLJLM there is no automatic scaling so you have to do this yourself
I see. Still the actual values of the parameter vector I used to generate this data (values of y given X) were 0.5*ones(7), while the values returned after the algorithm finishes (1000 iterations) is
8-element Array{Float64,1}:
1.8316760931663576
1.933975804712437
1.8442303256269603
1.8576032361270116
1.9316497337449647
1.8108914710429065
1.990582963169875
10.63631447226224
So ignoring the intercept, the values are of the parameters can be thought to be not so close to the original ones. May be running the algorithm for longer, might get it to converge nearer to the actual values of the parameters?