MLJ: how to read/interpret `measure` from evaluate

In this toy example I know one config is overtraining:

tree1 = EvoTreeClassifier(; nrounds = 10, eta = 0.1, max_depth=5)

tree2 = EvoTreeClassifier(; nrounds = 10, eta = 0.1, max_depth=10)

because I checked it via manual training and fitting:

mach = machine(tree, X, y)
train, test = partition(eachindex(y), 0.7);
fit!(mach, rows=train);

y_pred = predict(mach, @view X[test, :]);
fprs, tprs, ts = roc_curve(y_pred, @view y[test]);
f = lines(fprs, tprs; label="test")

y_pred = predict(mach, @view X[train, :]);
fprs, tprs, ts = roc_curve(y_pred, @view y[train]);
lines!(fprs, tprs; label="train")

axislegend()

Expand to see plots:

Two Plots


But I can’t tell from looking at result from evaluate(tree):

evaluate(tree1, X, y,
        resampling=CV(; nfolds=6, shuffle=nothing),
        measure=[log_loss, auc],
        verbosity=1)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ measure                        β”‚ operation β”‚ measurement β”‚ 1.96*SE β”‚ per_fol β‹―
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ LogLoss(                       β”‚ predict   β”‚ 0.509       β”‚ 0.00369 β”‚ [0.516, β‹―
β”‚   tol = 2.220446049250313e-16) β”‚           β”‚             β”‚         β”‚         β‹―
β”‚ AreaUnderCurve()               β”‚ predict   β”‚ 0.831       β”‚ 0.00429 β”‚ [0.823, β‹―
└────────────────────────────────┴───────────┴─────────────┴─────────┴──────────
#res.per_fold
2-element Vector{Vector}:
 [0.5164007967507921, 0.503903753403266, 0.5087914373970631, 0.5097046862399631, 0.5067821829301071, 0.5107959181599303]
 Any[0.8225134347902088, 0.8373614101583923, 0.8336094740471997, 0.8313957836206489, 0.831747928950465, 0.8315899300438203]


evaluate(tree2, X, y,
        resampling=CV(; nfolds=6, shuffle=nothing),
        measure=[log_loss, auc],
        verbosity=1)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ measure                        β”‚ operation β”‚ measurement β”‚ 1.96*SE β”‚ per_fol β‹―
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ LogLoss(                       β”‚ predict   β”‚ 0.484       β”‚ 0.0037  β”‚ [0.489, β‹―
β”‚   tol = 2.220446049250313e-16) β”‚           β”‚             β”‚         β”‚         β‹―
β”‚ AreaUnderCurve()               β”‚ predict   β”‚ 0.851       β”‚ 0.00314 β”‚ [0.845, β‹―
└────────────────────────────────┴───────────┴─────────────┴─────────┴──────────
 [0.48938241998998006, 0.4799705129014878, 0.4845229947659094, 0.48114972544923074, 0.4808197206490011, 0.48903871416415157]
 Any[0.8452396760282181, 0.8533083034135965, 0.8519176930326134, 0.8538765775152372, 0.8520602768012421, 0.846907808652614]

I don’t know if my eyes are not trained enough, but I can’t decide what’s a β€œgood” value from a β€œbad” value which I thought should be very simple given the ROC curve difference

same underlying mechanism causing problems for model hyperparameter tunning:

tree = EvoTreeClassifier()
rs = [
    range(tree, :eta, lower=0.001, origin=0.002, upper=0.1),
    range(tree, :nrounds, lower=10, origin=20, upper=250),
    range(tree, :min_weight, lower=0.5, origin=1.0, upper=3.5),
    range(tree, :colsample, lower=0.3, origin=0.8, upper=1.0),
    range(tree, :rowsample, lower=0.3, origin=0.8, upper=1.0),
    range(tree, :max_depth, lower=3, origin=4, upper=8),
];

tm = TunedModel(;
    model=tree,
    resampling = CV(nfolds=3),
    tuning=Grid(goal=15),
    range=rs, measure=[auc, log_loss]
    );

tunemach = machine(tm, X, y);

>fit!(tunemach; rows=train)
β”Œ Info: Training machine(ProbabilisticTunedModel(model = EvoTreeClassifier(nrounds = 10, …), …), …).
β”” @ MLJBase /home/jiling/.julia/packages/MLJBase/6ooqv/src/machines.jl:496
β”Œ Info: Attempting to evaluate 64 models.
β”” @ MLJTuning /home/jiling/.julia/packages/MLJTuning/eSEHt/src/tuned_models.jl:727
Evaluating over 64 metamodels: 100%[=========================] Time: 0:05:12

>fitted_params(tunemach).best_model

EvoTreeClassifier(
  nrounds = 250, 
  lambda = 0.0, 
  gamma = 0.0, 
  eta = 0.1, 
  max_depth = 8, 
  min_weight = 3.5, 
  rowsample = 1.0, 
  colsample = 1.0, 
  nbins = 32, 
  alpha = 0.5, 
  rng = Random.MersenneTwister(123, (0, 23413734, 23412732, 18)), 
  device = "cpu")

Checking for overfitting this way can be misleading.

  1. For model selection depending on the application domain, it might be less important to check for closeness of the training and test performance see here. It these cases it is sufficient to compare the out-of-sample scores of the two models. It seems plot 2 has a higher test AUC than plot 1 (I may be wrong. Try to plot both on the same pane).
  2. It might just that the plots above were due the nature of the actual train and test datasets used.

I would like to see the result of:

evaluate(tree1, X[train, :], y,
        resampling=CV(; nfolds=6, shuffle=nothing),
        measure=[log_loss, auc],
        verbosity=1)
evaluate(tree2, X[train, :], y,
        resampling=CV(; nfolds=6, shuffle=nothing),
        measure=[log_loss, auc],
        verbosity=1)
1 Like

here’s the result:

evaluate(tree1, X[train, :], y[train],
        resampling=CV(; nfolds=6, shuffle=nothing),
        measure=[log_loss, auc],
        verbosity=1)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ measure                        β”‚ operation β”‚ measurement β”‚ 1.96*SE β”‚ per_fol β‹―
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ LogLoss(                       β”‚ predict   β”‚ 0.511       β”‚ 0.00473 β”‚ [0.516, β‹―
β”‚   tol = 2.220446049250313e-16) β”‚           β”‚             β”‚         β”‚         β‹―
β”‚ AreaUnderCurve()               β”‚ predict   β”‚ 0.831       β”‚ 0.00608 β”‚ [0.823, β‹―
└────────────────────────────────┴───────────┴─────────────┴─────────┴──────────
res.per_fold
2-element Vector{Vector}:
 [0.5156692322207046, 0.5148959990683522, 0.5016058812139339, 0.5142813858720531, 0.508294783136379, 0.5092500658411819]
 Any[0.8230607740610442, 0.8246094193656129, 0.8414543292511101, 0.8274620221394415, 0.8349777148818776, 0.8316299858331504]

#########################################

evaluate(tree2, X[train, :], y[train],
        resampling=CV(; nfolds=6, shuffle=nothing),
        measure=[log_loss, auc],
        verbosity=1)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ measure                        β”‚ operation β”‚ measurement β”‚ 1.96*SE β”‚ per_fol β‹―
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ LogLoss(                       β”‚ predict   β”‚ 0.49        β”‚ 0.00448 β”‚ [0.493, β‹―
β”‚   tol = 2.220446049250313e-16) β”‚           β”‚             β”‚         β”‚         β‹―
β”‚ AreaUnderCurve()               β”‚ predict   β”‚ 0.844       β”‚ 0.00445 β”‚ [0.84,  β‹―
└────────────────────────────────┴───────────┴─────────────┴─────────┴──────────
res.per_fold
 [0.49313997400549425, 0.4960096759426803, 0.48619524416447407, 0.49555493107760795, 0.4851382554812855, 0.48597214007449996]
 Any[0.8401277524445331, 0.8384721512647232, 0.8486174757494099, 0.8400657842593326, 0.8494939978226504, 0.8480602057851294]

this is not gonna work, the training set will be part of our final evaluation set.

Unless, I do cross training and evaluate data only in the model that hasn’t seen it? is that legal?

What I’m suggesting is doing cross validation on the training dataset. The test dataset isn’t used here.

in the end the training dataset and test set both must be used.

say, I only have 10k data points and I have to use all 10k in the end, so if I CV on 9k and test on 1k, the gap between 9k and 1k still matters because in the end 90% of the data will have fake high performance

The reinforce what I think @samuel_okon is suggesting, a typical setup would involve considering 3 types of data: train, eval and test.
Let’s assume that out of this 10k, 2k are to be used as test data. Then, on the remaining 8k oservations, you could perform 4-fold CV. Each of the 8K observations would be assigned to one of the 4 CV bucket. Then for each of the 4 CV, a model is fitted on 6K data until it starts overfittnig on the 2k eval data. The selected model would be the one which consistently produced best performance throughout the 4 CVs. Then final assessment would be done on the 2k test that was initially set aside by training the selected model on the full 8K training data.

It’s still possible to assess performance on the training data by looking at the metric of each CV model on their eval data. However, this is exposed to some bias as typically, the early stopping cireteria comes from the performance on that eval data, hence there’s some information leakage.

If you only want to find the best hyperparameters, you don’t need to split into training and test data, but you can use all the data to do cross-validation. If you want to find the best hyperparameters and compute an estimate of the test error, you can use e.g. nested cross-validation.