The difficult task of comparing supervised models

I read often blogs or even academic articles that compare different machine learning for a specific tasks, ending up of a ranking based on lower MSE or maximum accuracy for example.

These comparisons are done by just applying the algorithm or, at best, using some cross validation.

My “problem” is that by my experience a given supervised model error/accuracy depends a lot from the training random initialisation, so the result of the “ranking” is a random variable as well, but I never saw
any hint on how reliable the ranking is.

I have several doubt/points that causes be troubles:

  • if we use the training set for the algorithm, the validation set for the hyperparameters (or even mixing them with k-fold cross-validation) and the test set for judging the best algorithm with the best hyperparameters, isn’t there a risk to overfit the test set?
  • you may say: get the variance of the algorithm from the cross-validation, but in reality the variance of the algorithm exists only conditional to a specific hyperparameter (e.g. the one for random forest will likely depend on the number of trees). Also, do cross-validation functions in commonly employed ML frameworks report the variance ?

So, at the end, how can I provide a potential reader/customer, a better informative report on the goodness of a certain algorithm, where I can write (1) how sure I am of my ranking and (2) which is the expected variance of the error deriving from my algorithm (perhaps the algorithm with the lowest expected error is unfortunately more variable)


I like your questions about overfitting, and maybe generalization in general, and think they are really valid.

Wouldn’t the approach normally be to tune the hyperparameters on the training set and report the scores on the validation set? If you don’t do the last step multiple times, then information doesn’t flow back, so you’re not overfitting (simply put).

Same reasoning as above. If you first fine-tune the model on the test set, then you can run k-fold cross-validation on the validation set without problems and report the variance for the cross-validation results.

But what if you want to update your model after you have used the validation set? I think that the only solution is to get more data. Then, you can use that to validate earlier modelling decisions.

1 Like

I’m curious to see how other here respond.

In my experience, the robust way to do this is to use many benchmark problems and use those to generate statistics of the performances across a range of relevant problems. It is also nice to have much much more data than you need and repeatedly subsample the data into the training/validation sets to generate statistics on a particular problem.

But assessing the reliability of supervised (especially online supervised models - digital twins, etc) is an open research topic. The automotive industry (as well as other service and transport industries) are working on this since this directly impacts their bottom line and brings up questions of liability.


I agree wholeheartedly with the point about having a “firewall” between your validation and testing data. Ideally the latter isn’t even seen until all models and hyperparams have been fully tuned on the validation set. Good public ML challenges do not expose a test set for this very reason. Academic or competition benchmarks do have the luxury of avoiding the “what if you want to update your model after you have used the validation set?” scenario @rikh describes, but computational resource availability can effect how much/what kind of cross-validation is performed. I’m not sure that’s strictly a bad thing, however, as it forces complex models like DNNs to be more resistant to hyperparam changes if they want to beat out a “cheaper” model given the same amount of compute time (assuming this is given and the experimental design is solid, which it often isn’t :smile:).