Hi all,
I am learning Julia, and so far I have implemented a few Bayesian models using Turing and other packages. I found it quite fast and reliable.
I am now building a neural network model for a land-use/land-cover task, and I want to ensure that my six predictive variables are standardised within each stratified fold to avoid data leakage. In other words, I want to standardise the dataset independently within each stratified fold and apply the scaling scores from the training dataset to the holdout dataset in that fold. I have used the ‘pipeline’ construct, but I am unsure if it is doing what I want:
##### Independent covariates
##### Select columns with raw values of 6 covariates
ind_covs = DataFrames.select(lulc, 4, 5, 6, 7, 8, 9)
ind_covs = DataFrames.rename(ind_covs, [:ndvi, :tree_cover, :ntree_cover, :elev, :slope, :pop])
nrow(ind_covs)
print(describe(ind_covs, :all))
##### Neural network classifier
NeuralNetworkClassifier = @load NeuralNetworkClassifier pkg = MLJFlux
##### Define the neural network classifier
lulc_classifier = NeuralNetworkClassifier(builder = MLJFlux.MLP(; hidden=(6,6), σ=Flux.relu),
epochs=100, loss = Flux.Losses.crossentropy)
##### Standardise (- is this correct?)
##### Is this doing what I want it to do?
stdz_classifier = Pipeline(Standardizer(), lulc_classifier)
##### Neural net
nnet_lulc = machine(stdz_classifier, ind_covs, land_use)
##### Evaluate the predictive perfomance of the nnet
mod_perf = evaluate!(nnet_lulc, resampling = StratifiedCV(; nfolds=10),
measure=[balanced_accuracy, cross_entropy])
Any tips and suggestions will be more than welcome.
Thanks!
Pablo