For my PhD, I’m mainly dealing with data containing 10 to 20 features and sample sizes of about 100 to 500. According to quite some literature, cross-validation (CV) is biased and should be replaced by nested cross-validation whenever you can, computationally, afford it (Krstajic et al, 2014; Vabalas et al, 2019). I was just thinking about writing a paper where I manually compare 4 models. Something like comparing a linear model to two Turing.jl models and maybe a random forest.
So, as a sanity check: should I put the 4 models in a nested cross-validation loop to get an automated answer to the following questions.
Which models performs the best?
How good will the best model perform?
I expect that runtime will be okay. Compared to cross-validation, runtime is only multiplied by the number of models than I want to compare and the outer loop can run in parallel.
Perhaps just whenever you use it to select a model? Just for validating a model CV should be fine I believe.
Be careful with automated answers unless you’re okay with automated mistakes.
I would say this question makes proper sense when stated as “Which models perform best in this very specific data set?”. You’ll find that any of those four models you’re considering will perform best or not based on which data set you choose. This is specially true if your paper is related to QSAR data sets like Krstajic et al, [2014].
Nested cross-validation was added to MLJ a while ago thanks to Anthony Blaom.
It can be done via a TunedModel (Tuning Models · MLJ). From the docs, to run 2 outer folds and 3 inner folds:
tree = (@load DecisionTreeClassifier pkg=DecisionTree verbosity=0)()
knn = (@load KNNClassifier pkg=NearestNeighborModels verbosity=0)()
models = [tree, knn]
# For each model, run 3-fold cross-validation (inner loop).
multi_model = TunedModel(models=models,
resampling=CV(nfolds=3),
measure=log_loss,
check_measure=false)
# For each trail, run 2-fold cross validation (outer loop).
e = evaluate(multi_model, X, y,
resampling=CV(nfolds=2),
measure=log_loss,
verbosity=6)
# Best model for the first trail.
e.report_per_fold[1].best_model
For more information, see also the example in scikit-learn.