Logistic Regression Problem


#1

I’m trying to use a Logistic Regression algorithm to find a classification model, but i get stuck with an error “failure to converge after 30 iterations” i have changed the maxIter arg to an higher value, but the error only disappears with a very high value of iterations, in the small example below with a population of 1000 elements and with a very simple implicit model i need 2000 iterations! Am i doing something wrong?

Blockquote
df=DataFrame()
n=1000
df[:x]=rand(n)
df[:y]=rand(n)
df[:z]=rand(n)
df[:valid]=map((x,y)->(x2-y6)>0? true : false,df[:x],df[:y])
glm(@formula(valid ~ x+y+z), df, Binomial(), LogitLink())


#2

Since there is no noise added to the linear predictor, it predicts the outcomes perfectly and, as a consequence, the MLE doesn’t exist, i.e. the likelihood function doesn’t have an optimum but keeps growing as one or more of the model parameters diverge. See e.g. https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faqwhat-is-complete-or-quasi-complete-separation-in-logisticprobit-regression-and-how-do-we-deal-with-them/. In ML, people usually add some regularization to ensure that an optimum exists but GLM doesn’t add regularization.


#3

Thks, that was a perfect answer!


#4

Is it planned functionality to add some simple normalization to GLM? If, as I’m assuming, the fitting of a GLM is essentially Newton-Raphson algorithm, it should be easy to add some optional L2 cost parameter. It could be a big plus for usability, otherwise it can be confusing for a new user that, as soon as your problem is a bit ill-defined/unstable, you need to switch to I guess GLMnet or Lasso which is a different package with different syntax etc.