Learning curve for validation dataset

niltsz · August 23, 2024, 12:08pm

I want to train a ML model with MLJ, for example an XGBoost Regressor with L2 Loss.
I split my dataset into three parts, train/val/test, and during training I want to train on the train dataset but also plot the loss/accuracy on the validation dataset so that I can do some early stopping.

I have been reading the docs for MLJ.evaluate!() and MLJ.learning_curve() but I do not understand how to do it.

Furthermore, how does Holdout() as a resampling strategy in evaluate!(…) work? I would be fine with splitting my dataset into train/test and then using Holdout to split off some of the training data for validation during the evaluation step. But how I do understand it is that Holdout() in evaluate! simply removes data, that is not used for training but also not used for loss evaluation.

When I look at the plot of the loss, to me it seems that the loss of learning_curve with Holdout() is computed on the data that is used for training while, with CV(), it is computed for the cross-validation part? With CV(), I can spot overtraining, i.e. loss increasing after a number of rounds, with Holdout() the loss decreases and then stays the same. I would expect for it to be the other way round as, with CV(), all the data will have been used for training.

If you want to I can provide a MWE, but it is not a bug fix but more a conceptual question.

Edit:
Maybe my question can be boiled down to: On which part of the data is the loss evaluated? If I do not specify any Holdout/CV, it is evaluated on the entire training dataset which is at the same time also used for updating the weights of the ML?

On the other hand, if I specify Holdout/CV, the ML is updated using only a part of the training data, and the loss for the learning curve is reported on the remaining Holdout/CV part of the training data?

niltsz · August 26, 2024, 8:17am

I have solved it by myself. The only question remaining is how to get this learning curve where you see the loss on the training data decrease and the loss on the Holdout part first decrease and that, when overtraining, increase again.

Topic		Replies	Views
MLJ: how to read/interpret `measure` from evaluate General Usage question , mlj	9	551	November 4, 2022
MethodError: No Method Matching Learning Curve Machine Learning question , mlj	3	513	July 22, 2021
How to implement MLJ models properly for `learning_curve!` Machine Learning mlj	2	1133	December 3, 2019
MLJ Tuning and Hyperparameters , Regression Performance optimization , machine-learning , mlj	0	287	November 20, 2022
LightHouse.jl Implementation Visualization	2	462	July 22, 2021

Learning curve for validation dataset

Related topics