Regression Random Forests: Mean coeff of determination in cross-validation surprisingly too good for pure noise inputs

I fitted a regression RF model on random (low dimensional) noise data and got surprising good estimates of the Mean Coeff of Determination using Cross-Validation.

Here is my code, based on the example in https://github.com/bensadeghi/DecisionTree.jl

using Random
using DecisionTree

Random.seed!(2020)
nsamples = 100
nfeatures = 6

# training features and labels
xTR = rand(nsamples,nfeatures)
yTR = rand(nsamples)

# testing features and labels
xTE = rand(nsamples,nfeatures)
yTE = rand(nsamples)

n_subfeatures=round(Int,nfeatures/2); n_trees=50; partial_sampling=0.7; max_depth=-1
min_samples_leaf=1; min_samples_split=2; min_purity_increase=0.0; seed=3

model = build_forest(yTR, xTR,
                     n_subfeatures,
                     n_trees,
                     partial_sampling,
                     max_depth,
                     min_samples_leaf,
                     min_samples_split,
                     min_purity_increase;
                     rng = seed)

n_folds=3

r2 =  nfoldCV_forest(yTR, xTR,
                     n_folds,
                     n_subfeatures,
                     n_trees,
                     partial_sampling,
                     max_depth,
                     min_samples_leaf,
                     min_samples_split,
                     min_purity_increase;
                     verbose = true,
                     rng = seed)

yTE_hat = apply_forest(model, xTE)
yTR_hat = apply_forest(model, xTR)

@info(". Coefficient of determination for training $(DecisionTree.R2(yTR, yTR_hat))")
@info(". Coefficient of determination for testing $(DecisionTree.R2(yTE, yTE_hat))")

This gives me

Fold 1
Mean Squared Error:     0.020002234552075074
Correlation Coeff:      0.9623930625920759
Coeff of Determination: 0.7602402973281025

Fold 2
Mean Squared Error:     0.020702135796228545
Correlation Coeff:      0.9464875388483548
Coeff of Determination: 0.7330648421477141

Fold 3
Mean Squared Error:     0.013866320099408993
Correlation Coeff:      0.9433743891893996
Coeff of Determination: 0.7419187614811039

Mean Coeff of Determination: 0.7450746336523069

[ Info: . Coefficient of determination for training 0.751643802875819
[ Info: . Coefficient of determination for testing -0.09547815687242323

I would expect the Mean Coeff of Determination in the 3 folds to be about zero, because my data is just noise. I got more or less this in my unseen test data (~-0.09).

Please, what am I missing in my code?
I am using DecisionTree v0.10.9

Thanks

That actually kind of seems like a bug in nfoldCV_forest. It appears to be reporting the training R2 for each fold instead of the test R2 for each fold.

You could try running a cross-validation by hand, or you could use MLJ.jl to run the cross-validation.

1 Like