MLJ/MljFlux standardisation of variables in each cross-validation fold

Hi all,

I am learning Julia, and so far I have implemented a few Bayesian models using Turing and other packages. I found it quite fast and reliable.

I am now building a neural network model for a land-use/land-cover task, and I want to ensure that my six predictive variables are standardised within each stratified fold to avoid data leakage. In other words, I want to standardise the dataset independently within each stratified fold and apply the scaling scores from the training dataset to the holdout dataset in that fold. I have used the ‘pipeline’ construct, but I am unsure if it is doing what I want:

##### Independent covariates
##### Select columns with raw values of 6 covariates

ind_covs = DataFrames.select(lulc, 4, 5, 6, 7, 8, 9)

ind_covs = DataFrames.rename(ind_covs, [:ndvi, :tree_cover, :ntree_cover, :elev, :slope, :pop])

nrow(ind_covs)

print(describe(ind_covs, :all))

##### Neural network classifier
NeuralNetworkClassifier = @load NeuralNetworkClassifier pkg = MLJFlux

##### Define the neural network classifier
lulc_classifier = NeuralNetworkClassifier(builder = MLJFlux.MLP(; hidden=(6,6), σ=Flux.relu),
            epochs=100, loss = Flux.Losses.crossentropy)

##### Standardise (- is this correct?)
##### Is this doing what I want it to do? 
stdz_classifier = Pipeline(Standardizer(), lulc_classifier)

##### Neural net
nnet_lulc = machine(stdz_classifier, ind_covs, land_use)

##### Evaluate the predictive perfomance of the nnet
mod_perf = evaluate!(nnet_lulc, resampling = StratifiedCV(; nfolds=10), 
                               measure=[balanced_accuracy, cross_entropy])

Any tips and suggestions will be more than welcome.

Thanks!

Pablo

Yes, this does what you think it does: the standardisation and superivised learning use only the data from each training fold.

BTW, if you were doing regression, you could additionally standardise the target, using the TransformedTargetModel wrapper (docs).

Thanks! Just to double-check that I understand it correctly – is the procedure standardising the test dataset too? If that is the case, there is no data leakage between the training and testing datasets, right?

Thank you very much

Pablo

Yes, for each test fold, the normalisation applied to it is performed using the mean and standard deviation learned from the corresponding train fold. So no data leakage.

Outstanding! Thank you very much