I am puzzled by an issue with a boolean classification task using RF on a large dimensional dataset (1680 obs x 110 dim) and moderate imbalance (431 vs 1249).
train/test partition is random (0.8,0.2)
When I train the RF algorithm I have almost perfect accuracy/confusion matrix, but when I predict on the test set I have almost all predictions on the same class.
I initially thought ok, it’s just overfitting, but first RF do not overfit too badly, and secondly I have not “random” error predictions, like there is no connection between X and Y, but all prediction are on the same class.
But there isn’t anything specific between train and test sets, whet the hell could be ?
Note that I have this issue using by own RF, but the student originally got the same problem using RF in R…
Here the exact code…
using Pkg
Pkg.activate(@__DIR__)
using Dates, Random, Pipe, HTTP, CSV, DelimitedFiles, DataFrames, Plots, BetaML
Random.seed!(1234)
dataURL = "https://nc.beta-lorraine.fr/s/68iQB56rataBiYZ/download" # ~ 3 MB
data = @pipe HTTP.get(dataURL).body |> CSV.File(_, missingstring="NA") |> DataFrame
(n,d) = size(data)
ycat = data.class_boolean
fields_toremove1 = ["class_boolean"]
data = data[:,Not(fields_toremove1)]
X = Matrix(data)
((x_train,x_test),(ycat_train,ycat_test)) = partition([X,ycat],[0.8,0.2], shuffle=true)
m = RandomForestEstimator(n_trees=30, force_classification=true, oob=true)
ŷ_train = fit!(m,x_train,ycat_train)
ŷ_train = mode(ŷ_train)
ŷ_test = predict(m,x_test)
ŷ_test = mode(ŷ_test)
train_acc = accuracy(ycat_train, ŷ_train) # 0.998
test_accuracy_est = 1-info(m)["oob_errors"] # 0.710
test_acc = accuracy(ycat_test, ŷ_test) # 0.767
sum(ycat_train)/length(ycat_train) # 0.2619047619047619
sum(ycat_test)/length(ycat_test) # 0.23511904761904762
sum(ŷ_train)/length(ŷ_train) # 0.2604166666666667
sum(ŷ_test)/length(ŷ_test) # 0.044642857142857144
Very odd… I can get top score in training and testing if I rebalance the data with random sampling on the whole dataset (the one that I will then partition in training/testing) but if I rebalance only the training set then the test set predictions are again only on one category…