I am currently trying to solve the following optimization problem

Where N is the number of observations on \mathbf{x} predictors, and y \in \{-1, 1\} is a binary vector of length N, \lambda has to be integer, C_0, C_1 are tuning parameters penalizing complexity of the model. Illustrative example: suppose you would like to predict if a mushroom is poisonous (y) and have predictors like its odor and it’s color (x_1, x_2). I would like to obtain a solution similar to this:

However, I am new to optimization problems (I have a statistical background) and thought I’d be a nice opportunity to get to know Julia better.

Now I am a bit lost - I understood how to generally formulate and solve MIP problems with JuMP. However, I don’t know how to optimize the 0-1 loss between a predicted category and the actual category. Also, I don’t know how to formulate the first penalty term C_0 ||\lambda||_0 where ||\lambda||_0 is one if lambda is nonzero, and 0 if \lambda is zero.

Any help, and also resources about such types of problems, would be greatly appreciated.

Pictures from

Ustun, B., Traca, S., & Rudin, C. (2013). Supersparse linear integer models for interpretable classification. *arXiv preprint arXiv:1306.6677* .