MLJ: how to read/interpret `measure` from evaluate

jling · November 3, 2022, 3:53am

In this toy example I know one config is overtraining:

tree1 = EvoTreeClassifier(; nrounds = 10, eta = 0.1, max_depth=5)

tree2 = EvoTreeClassifier(; nrounds = 10, eta = 0.1, max_depth=10)

because I checked it via manual training and fitting:

mach = machine(tree, X, y)
train, test = partition(eachindex(y), 0.7);
fit!(mach, rows=train);

y_pred = predict(mach, @view X[test, :]);
fprs, tprs, ts = roc_curve(y_pred, @view y[test]);
f = lines(fprs, tprs; label="test")

y_pred = predict(mach, @view X[train, :]);
fprs, tprs, ts = roc_curve(y_pred, @view y[train]);
lines!(fprs, tprs; label="train")

axislegend()

Expand to see plots:

Two Plots

But I can’t tell from looking at result from evaluate(tree):

evaluate(tree1, X, y,
        resampling=CV(; nfolds=6, shuffle=nothing),
        measure=[log_loss, auc],
        verbosity=1)
┌────────────────────────────────┬───────────┬─────────────┬─────────┬──────────
│ measure                        │ operation │ measurement │ 1.96*SE │ per_fol ⋯
├────────────────────────────────┼───────────┼─────────────┼─────────┼──────────
│ LogLoss(                       │ predict   │ 0.509       │ 0.00369 │ [0.516, ⋯
│   tol = 2.220446049250313e-16) │           │             │         │         ⋯
│ AreaUnderCurve()               │ predict   │ 0.831       │ 0.00429 │ [0.823, ⋯
└────────────────────────────────┴───────────┴─────────────┴─────────┴──────────
#res.per_fold
2-element Vector{Vector}:
 [0.5164007967507921, 0.503903753403266, 0.5087914373970631, 0.5097046862399631, 0.5067821829301071, 0.5107959181599303]
 Any[0.8225134347902088, 0.8373614101583923, 0.8336094740471997, 0.8313957836206489, 0.831747928950465, 0.8315899300438203]


evaluate(tree2, X, y,
        resampling=CV(; nfolds=6, shuffle=nothing),
        measure=[log_loss, auc],
        verbosity=1)
┌────────────────────────────────┬───────────┬─────────────┬─────────┬──────────
│ measure                        │ operation │ measurement │ 1.96*SE │ per_fol ⋯
├────────────────────────────────┼───────────┼─────────────┼─────────┼──────────
│ LogLoss(                       │ predict   │ 0.484       │ 0.0037  │ [0.489, ⋯
│   tol = 2.220446049250313e-16) │           │             │         │         ⋯
│ AreaUnderCurve()               │ predict   │ 0.851       │ 0.00314 │ [0.845, ⋯
└────────────────────────────────┴───────────┴─────────────┴─────────┴──────────
 [0.48938241998998006, 0.4799705129014878, 0.4845229947659094, 0.48114972544923074, 0.4808197206490011, 0.48903871416415157]
 Any[0.8452396760282181, 0.8533083034135965, 0.8519176930326134, 0.8538765775152372, 0.8520602768012421, 0.846907808652614]

jling · November 3, 2022, 3:55am

I don’t know if my eyes are not trained enough, but I can’t decide what’s a “good” value from a “bad” value which I thought should be very simple given the ROC curve difference

jling · November 3, 2022, 4:54am

same underlying mechanism causing problems for model hyperparameter tunning:

tree = EvoTreeClassifier()
rs = [
    range(tree, :eta, lower=0.001, origin=0.002, upper=0.1),
    range(tree, :nrounds, lower=10, origin=20, upper=250),
    range(tree, :min_weight, lower=0.5, origin=1.0, upper=3.5),
    range(tree, :colsample, lower=0.3, origin=0.8, upper=1.0),
    range(tree, :rowsample, lower=0.3, origin=0.8, upper=1.0),
    range(tree, :max_depth, lower=3, origin=4, upper=8),
];

tm = TunedModel(;
    model=tree,
    resampling = CV(nfolds=3),
    tuning=Grid(goal=15),
    range=rs, measure=[auc, log_loss]
    );

tunemach = machine(tm, X, y);

>fit!(tunemach; rows=train)
┌ Info: Training machine(ProbabilisticTunedModel(model = EvoTreeClassifier(nrounds = 10, …), …), …).
└ @ MLJBase /home/jiling/.julia/packages/MLJBase/6ooqv/src/machines.jl:496
┌ Info: Attempting to evaluate 64 models.
└ @ MLJTuning /home/jiling/.julia/packages/MLJTuning/eSEHt/src/tuned_models.jl:727
Evaluating over 64 metamodels: 100%[=========================] Time: 0:05:12

>fitted_params(tunemach).best_model

EvoTreeClassifier(
  nrounds = 250, 
  lambda = 0.0, 
  gamma = 0.0, 
  eta = 0.1, 
  max_depth = 8, 
  min_weight = 3.5, 
  rowsample = 1.0, 
  colsample = 1.0, 
  nbins = 32, 
  alpha = 0.5, 
  rng = Random.MersenneTwister(123, (0, 23413734, 23412732, 18)), 
  device = "cpu")

samuel_okon · November 3, 2022, 4:31pm

Checking for overfitting this way can be misleading.

For model selection depending on the application domain, it might be less important to check for closeness of the training and test performance see here. It these cases it is sufficient to compare the out-of-sample scores of the two models. It seems plot 2 has a higher test AUC than plot 1 (I may be wrong. Try to plot both on the same pane).
It might just that the plots above were due the nature of the actual train and test datasets used.

I would like to see the result of:

evaluate(tree1, X[train, :], y,
        resampling=CV(; nfolds=6, shuffle=nothing),
        measure=[log_loss, auc],
        verbosity=1)
evaluate(tree2, X[train, :], y,
        resampling=CV(; nfolds=6, shuffle=nothing),
        measure=[log_loss, auc],
        verbosity=1)

jling · November 3, 2022, 5:29pm

here’s the result:

evaluate(tree1, X[train, :], y[train],
        resampling=CV(; nfolds=6, shuffle=nothing),
        measure=[log_loss, auc],
        verbosity=1)
┌────────────────────────────────┬───────────┬─────────────┬─────────┬──────────
│ measure                        │ operation │ measurement │ 1.96*SE │ per_fol ⋯
├────────────────────────────────┼───────────┼─────────────┼─────────┼──────────
│ LogLoss(                       │ predict   │ 0.511       │ 0.00473 │ [0.516, ⋯
│   tol = 2.220446049250313e-16) │           │             │         │         ⋯
│ AreaUnderCurve()               │ predict   │ 0.831       │ 0.00608 │ [0.823, ⋯
└────────────────────────────────┴───────────┴─────────────┴─────────┴──────────
res.per_fold
2-element Vector{Vector}:
 [0.5156692322207046, 0.5148959990683522, 0.5016058812139339, 0.5142813858720531, 0.508294783136379, 0.5092500658411819]
 Any[0.8230607740610442, 0.8246094193656129, 0.8414543292511101, 0.8274620221394415, 0.8349777148818776, 0.8316299858331504]

#########################################

evaluate(tree2, X[train, :], y[train],
        resampling=CV(; nfolds=6, shuffle=nothing),
        measure=[log_loss, auc],
        verbosity=1)
┌────────────────────────────────┬───────────┬─────────────┬─────────┬──────────
│ measure                        │ operation │ measurement │ 1.96*SE │ per_fol ⋯
├────────────────────────────────┼───────────┼─────────────┼─────────┼──────────
│ LogLoss(                       │ predict   │ 0.49        │ 0.00448 │ [0.493, ⋯
│   tol = 2.220446049250313e-16) │           │             │         │         ⋯
│ AreaUnderCurve()               │ predict   │ 0.844       │ 0.00445 │ [0.84,  ⋯
└────────────────────────────────┴───────────┴─────────────┴─────────┴──────────
res.per_fold
 [0.49313997400549425, 0.4960096759426803, 0.48619524416447407, 0.49555493107760795, 0.4851382554812855, 0.48597214007449996]
 Any[0.8401277524445331, 0.8384721512647232, 0.8486174757494099, 0.8400657842593326, 0.8494939978226504, 0.8480602057851294]

jling · November 3, 2022, 5:32pm

this is not gonna work, the training set will be part of our final evaluation set.

Unless, I do cross training and evaluate data only in the model that hasn’t seen it? is that legal?

samuel_okon · November 3, 2022, 7:14pm

What I’m suggesting is doing cross validation on the training dataset. The test dataset isn’t used here.

jling · November 3, 2022, 7:21pm

in the end the training dataset and test set both must be used.

say, I only have 10k data points and I have to use all 10k in the end, so if I CV on 9k and test on 1k, the gap between 9k and 1k still matters because in the end 90% of the data will have fake high performance

jeremiedb · November 4, 2022, 12:27am

The reinforce what I think @samuel_okon is suggesting, a typical setup would involve considering 3 types of data: train, eval and test.
Let’s assume that out of this 10k, 2k are to be used as test data. Then, on the remaining 8k oservations, you could perform 4-fold CV. Each of the 8K observations would be assigned to one of the 4 CV bucket. Then for each of the 4 CV, a model is fitted on 6K data until it starts overfittnig on the 2k eval data. The selected model would be the one which consistently produced best performance throughout the 4 CVs. Then final assessment would be done on the 2k test that was initially set aside by training the selected model on the full 8K training data.

It’s still possible to assess performance on the training data by looking at the metric of each CV model on their eval data. However, this is exposed to some bias as typically, the early stopping cireteria comes from the performance on that eval data, hence there’s some information leakage.

jbrea · November 4, 2022, 9:21am

If you only want to find the best hyperparameters, you don’t need to split into training and test data, but you can use all the data to do cross-validation. If you want to find the best hyperparameters and compute an estimate of the test error, you can use e.g. nested cross-validation.

Topic		Replies	Views
MLJ Tuning and Hyperparameters , Regression Performance optimization , machine-learning , mlj	0	287	November 20, 2022
Automate training MLJ models Machine Learning machine-learning , mlj	14	2118	February 17, 2020
Lots of messages printed out when I evaluate the model using MLJ Machine Learning	5	353	April 13, 2022
Learning curve for validation dataset Machine Learning question , mlj , tuning	1	72	August 26, 2024
Custom XGBoost Loss function w/ Zygote. Julia Computing blog post Machine Learning zygote , kaggle	36	4926	April 29, 2020

MLJ: how to read/interpret `measure` from evaluate

Related topics