Optimize 0-1 loss in MIP

I am currently trying to solve the following optimization problem

Where N is the number of observations on \mathbf{x} predictors, and y \in \{-1, 1\} is a binary vector of length N, \lambda has to be integer, C_0, C_1 are tuning parameters penalizing complexity of the model. Illustrative example: suppose you would like to predict if a mushroom is poisonous (y) and have predictors like its odor and it’s color (x_1, x_2). I would like to obtain a solution similar to this:

However, I am new to optimization problems (I have a statistical background) and thought I’d be a nice opportunity to get to know Julia better.
Now I am a bit lost - I understood how to generally formulate and solve MIP problems with JuMP. However, I don’t know how to optimize the 0-1 loss between a predicted category and the actual category. Also, I don’t know how to formulate the first penalty term C_0 ||\lambda||_0 where ||\lambda||_0 is one if lambda is nonzero, and 0 if \lambda is zero.

Any help, and also resources about such types of problems, would be greatly appreciated.

Pictures from
Ustun, B., Traca, S., & Rudin, C. (2013). Supersparse linear integer models for interpretable classification. arXiv preprint arXiv:1306.6677 .

I believe the paper you mention actually provides a full MIP formulation.

In any case, to represent the “0-norm”, you could introduce a binary variable a_j \in \{0, 1\} for every component j of \lambda. Then you can minimize the C_0\sum_j a_j.
You also need to add constraints of the form \lambda_j \ge M a_j where M is a suitable upper bound on the values that \lambda can take.