I want to train a ML model with MLJ, for example an XGBoost Regressor with L2 Loss.
I split my dataset into three parts, train/val/test, and during training I want to train on the train dataset but also plot the loss/accuracy on the validation dataset so that I can do some early stopping.
I have been reading the docs for MLJ.evaluate!() and MLJ.learning_curve() but I do not understand how to do it.
Furthermore, how does Holdout() as a resampling strategy in evaluate!(…) work? I would be fine with splitting my dataset into train/test and then using Holdout to split off some of the training data for validation during the evaluation step. But how I do understand it is that Holdout() in evaluate! simply removes data, that is not used for training but also not used for loss evaluation.
When I look at the plot of the loss, to me it seems that the loss of learning_curve with Holdout() is computed on the data that is used for training while, with CV(), it is computed for the cross-validation part? With CV(), I can spot overtraining, i.e. loss increasing after a number of rounds, with Holdout() the loss decreases and then stays the same. I would expect for it to be the other way round as, with CV(), all the data will have been used for training.
If you want to I can provide a MWE, but it is not a bug fix but more a conceptual question.
Edit:
Maybe my question can be boiled down to: On which part of the data is the loss evaluated? If I do not specify any Holdout/CV, it is evaluated on the entire training dataset which is at the same time also used for updating the weights of the ML?
On the other hand, if I specify Holdout/CV, the ML is updated using only a part of the training data, and the loss for the learning curve is reported on the remaining Holdout/CV part of the training data?