MLJ/MljFlux standardisation of variables in each cross-validation fold

Pablo_G-D · March 8, 2026, 3:24pm

Hi all,

I am learning Julia, and so far I have implemented a few Bayesian models using Turing and other packages. I found it quite fast and reliable.

I am now building a neural network model for a land-use/land-cover task, and I want to ensure that my six predictive variables are standardised within each stratified fold to avoid data leakage. In other words, I want to standardise the dataset independently within each stratified fold and apply the scaling scores from the training dataset to the holdout dataset in that fold. I have used the ‘pipeline’ construct, but I am unsure if it is doing what I want:

##### Independent covariates
##### Select columns with raw values of 6 covariates

ind_covs = DataFrames.select(lulc, 4, 5, 6, 7, 8, 9)

ind_covs = DataFrames.rename(ind_covs, [:ndvi, :tree_cover, :ntree_cover, :elev, :slope, :pop])

nrow(ind_covs)

print(describe(ind_covs, :all))

##### Neural network classifier
NeuralNetworkClassifier = @load NeuralNetworkClassifier pkg = MLJFlux

##### Define the neural network classifier
lulc_classifier = NeuralNetworkClassifier(builder = MLJFlux.MLP(; hidden=(6,6), σ=Flux.relu),
            epochs=100, loss = Flux.Losses.crossentropy)

##### Standardise (- is this correct?)
##### Is this doing what I want it to do? 
stdz_classifier = Pipeline(Standardizer(), lulc_classifier)

##### Neural net
nnet_lulc = machine(stdz_classifier, ind_covs, land_use)

##### Evaluate the predictive perfomance of the nnet
mod_perf = evaluate!(nnet_lulc, resampling = StratifiedCV(; nfolds=10), 
                               measure=[balanced_accuracy, cross_entropy])

Any tips and suggestions will be more than welcome.

Thanks!

Pablo

ablaom · March 8, 2026, 7:28pm

Yes, this does what you think it does: the standardisation and superivised learning use only the data from each training fold.

BTW, if you were doing regression, you could additionally standardise the target, using the TransformedTargetModel wrapper (docs).

Pablo_G-D · March 9, 2026, 3:55pm

Thanks! Just to double-check that I understand it correctly – is the procedure standardising the test dataset too? If that is the case, there is no data leakage between the training and testing datasets, right?

Thank you very much

Pablo

ablaom · March 9, 2026, 6:40pm

Yes, for each test fold, the normalisation applied to it is performed using the mean and standard deviation learned from the corresponding train fold. So no data leakage.

Pablo_G-D · March 10, 2026, 3:54pm

Outstanding! Thank you very much

Topic		Replies	Views
Problem standardizing data with MLJ + NaN predictions in Flux Machine Learning flux , mlj , nan	3	1093	November 8, 2021
Can't replicate neural network from Python's sklearn using Flux.jl General Usage flux , machine-learning , neural-network	11	1458	June 20, 2021
MLJFlux is a lot slower than the same algorithm written in Flux Performance flux , machine-learning , mlj , neural-network	3	1675	June 21, 2021
Right way of applying `inverse_transform` Machine Learning mlj , standardized	3	1294	June 25, 2022
How to approach this piece of code? New to Julia question	3	677	March 8, 2022

MLJ/MljFlux standardisation of variables in each cross-validation fold

Related topics