MLJ/MljFlux standardisation of variables in each cross-validation fold

Hi all,

I am learning Julia, and so far I have implemented a few Bayesian models using Turing and other packages. I found it quite fast and reliable.

I am now building a neural network model for a land-use/land-cover task, and I want to ensure that my six predictive variables are standardised within each stratified fold to avoid data leakage. In other words, I want to standardise the dataset independently within each stratified fold and apply the scaling scores from the training dataset to the holdout dataset in that fold. I have used the ‘pipeline’ construct, but I am unsure if it is doing what I want:

##### Independent covariates
##### Select columns with raw values of 6 covariates

ind_covs = DataFrames.select(lulc, 4, 5, 6, 7, 8, 9)

ind_covs = DataFrames.rename(ind_covs, [:ndvi, :tree_cover, :ntree_cover, :elev, :slope, :pop])

nrow(ind_covs)

print(describe(ind_covs, :all))

##### Neural network classifier
NeuralNetworkClassifier = @load NeuralNetworkClassifier pkg = MLJFlux

##### Define the neural network classifier
lulc_classifier = NeuralNetworkClassifier(builder = MLJFlux.MLP(; hidden=(6,6), σ=Flux.relu),
            epochs=100, loss = Flux.Losses.crossentropy)

##### Standardise (- is this correct?)
##### Is this doing what I want it to do? 
stdz_classifier = Pipeline(Standardizer(), lulc_classifier)

##### Neural net
nnet_lulc = machine(stdz_classifier, ind_covs, land_use)

##### Evaluate the predictive perfomance of the nnet
mod_perf = evaluate!(nnet_lulc, resampling = StratifiedCV(; nfolds=10), 
                               measure=[balanced_accuracy, cross_entropy])

Any tips and suggestions will be more than welcome.

Thanks!

Pablo

Yes, this does what you think it does: the standardisation and superivised learning use only the data from each training fold.

BTW, if you were doing regression, you could additionally standardise the target, using the TransformedTargetModel wrapper (docs).