Strange situation RF Classification: perfect train, test prediction all one category

sylvaticus · March 28, 2024, 9:10am

I am puzzled by an issue with a boolean classification task using RF on a large dimensional dataset (1680 obs x 110 dim) and moderate imbalance (431 vs 1249).
train/test partition is random (0.8,0.2)

When I train the RF algorithm I have almost perfect accuracy/confusion matrix, but when I predict on the test set I have almost all predictions on the same class.

I initially thought ok, it’s just overfitting, but first RF do not overfit too badly, and secondly I have not “random” error predictions, like there is no connection between X and Y, but all prediction are on the same class.
But there isn’t anything specific between train and test sets, whet the hell could be ?

Note that I have this issue using by own RF, but the student originally got the same problem using RF in R…

sylvaticus · March 28, 2024, 9:37am

Here the exact code…

using Pkg
Pkg.activate(@__DIR__)

using Dates, Random, Pipe, HTTP, CSV, DelimitedFiles, DataFrames, Plots, BetaML

Random.seed!(1234)

dataURL = "https://nc.beta-lorraine.fr/s/68iQB56rataBiYZ/download" # ~ 3 MB

data             = @pipe HTTP.get(dataURL).body     |> CSV.File(_, missingstring="NA") |> DataFrame
(n,d)            = size(data)
ycat             = data.class_boolean
fields_toremove1 =  ["class_boolean"]
data             = data[:,Not(fields_toremove1)]
X                = Matrix(data)
((x_train,x_test),(ycat_train,ycat_test)) = partition([X,ycat],[0.8,0.2], shuffle=true)

m = RandomForestEstimator(n_trees=30, force_classification=true, oob=true)
ŷ_train = fit!(m,x_train,ycat_train)
ŷ_train = mode(ŷ_train)
ŷ_test  = predict(m,x_test)
ŷ_test  = mode(ŷ_test)

train_acc         = accuracy(ycat_train, ŷ_train) # 0.998
test_accuracy_est = 1-info(m)["oob_errors"]       # 0.710
test_acc          = accuracy(ycat_test, ŷ_test)   # 0.767

sum(ycat_train)/length(ycat_train) # 0.2619047619047619
sum(ycat_test)/length(ycat_test)   # 0.23511904761904762
sum(ŷ_train)/length(ŷ_train)       # 0.2604166666666667
sum(ŷ_test)/length(ŷ_test)         # 0.044642857142857144

sylvaticus · March 28, 2024, 1:45pm

Very odd… I can get top score in training and testing if I rebalance the data with random sampling on the whole dataset (the one that I will then partition in training/testing) but if I rebalance only the training set then the test set predictions are again only on one category…

Topic		Replies	Views
How to get reproducible results of classification models?: example using DecisionTree Machine Learning question	0	390	March 24, 2020
Regression Random Forests: Mean coeff of determination in cross-validation surprisingly too good for pure noise inputs Machine Learning question	1	451	September 4, 2020
Failed prediction from BetaML DecisionTreeClassifier Machine Learning question	10	584	September 22, 2022
My Random Forest is very slow Performance	10	4756	August 28, 2020
Unexpected Behavior in LogisticClassifier MLJLinearModels Machine Learning question	7	642	November 16, 2022

Strange situation RF Classification: perfect train, test prediction all one category

Related topics