I read often blogs or even academic articles that compare different machine learning for a specific tasks, ending up of a ranking based on lower MSE or maximum accuracy for example.
These comparisons are done by just applying the algorithm or, at best, using some cross validation.
My “problem” is that by my experience a given supervised model error/accuracy depends a lot from the training random initialisation, so the result of the “ranking” is a random variable as well, but I never saw
any hint on how reliable the ranking is.
I have several doubt/points that causes be troubles:
- if we use the training set for the algorithm, the validation set for the hyperparameters (or even mixing them with k-fold cross-validation) and the test set for judging the best algorithm with the best hyperparameters, isn’t there a risk to overfit the test set?
- you may say: get the variance of the algorithm from the cross-validation, but in reality the variance of the algorithm exists only conditional to a specific hyperparameter (e.g. the one for random forest will likely depend on the number of trees). Also, do cross-validation functions in commonly employed ML frameworks report the variance ?
So, at the end, how can I provide a potential reader/customer, a better informative report on the goodness of a certain algorithm, where I can write (1) how sure I am of my ranking and (2) which is the expected variance of the error deriving from my algorithm (perhaps the algorithm with the lowest expected error is unfortunately more variable)