Failed prediction from BetaML DecisionTreeClassifier

I am trying to do inference on the Adult dataset, which has a binary response variable (income below or above $50k/year). Seemingly I am able to train a Decision Tree using the dataset (as loaded using Fairness.jl), but upon predicting on unseen data I get nonsensical results (every instance is assigned zero probability for both classes). I am not sure what could be causing this, so any help is much appreciated! MWE below:

using Fairness, DataFrames, MLJ

# Load train and test data
features, labels = @load_adult
train, test = partition(eachindex(labels), 0.7)

# Load model
Tree = @load DecisionTreeClassifier pkg=BetaML
tree = Tree()
model = machine(tree, features, labels)

# Train model
fit!(model, rows = train)

# Test predictions
predictions = predict(model, features[test, :])

Additionally, if I try to force a point estimate with predict_mode instead of predict, all predictions default to the negative class, which is obviously not desired.

MLJ provides two other DecisionTree model types, which you can load with

Tree1 = @load DecisionTreeClassifier pkg=DecisionTree
Tree2 = @load DecisionTreeClassifier pkg=ScikitLearn

You can use them assuming your features are all Continuous (OrderedFactor also okay for the first one). I don’t think missings are supported.

Do they give very different results? (Take some care, as the default hyperparameters may be different).

Thanks! I had strayed away from using the DecisionTree package because of incompatible scitypes for the data as it was, but doing one-hot encoding of multi-class, non ordered features with ContinuousEncoder() and defining my model with

@pipeline ContinuousEncoder() Tree()

after loading Tree from DecisionTree seems to have done the trick.

Okay, that’s good to know. Can you please confirm that substituting the BetaML model in your pipeline is still giving suspicious output?

@sylvaticus

By the way, @liamfdoherty, @pipeline is now deprecated. You should just do pipe = ContinuousEncoder |> Tree() moving forward.

Substituting the BetaML model in the (working with DecisionTree) pipeline still only predicts the negative class.

When I do pipe = ContinuousEncoder |> Tree() I get the following error:

ERROR: MethodError: no method matching (::MLJDecisionTreeInterface.DecisionTreeClassifier)(::ContinuousEncoder)
Closest candidates are:
  (::Supervised)(::Tuple{AbstractMatrix, Any}) at ~/.julia/packages/MLJBase/QXObv/src/composition/learning_networks/arrows.jl:25
  (::Supervised)(::Tuple{AbstractNode, AbstractNode}) at ~/.julia/packages/MLJBase/QXObv/src/composition/learning_networks/arrows.jl:21
  (::Supervised)(::Tuple{AbstractNode, Any}) at ~/.julia/packages/MLJBase/QXObv/src/composition/learning_networks/arrows.jl:22
  ...
Stacktrace:
 [1] |>(x::ContinuousEncoder, f::MLJDecisionTreeInterface.DecisionTreeClassifier)
   @ Base ./operators.jl:911
 [2] top-level scope
   @ ~/Documents/PhD Year 3/Fall Quarter/Research/AdultAnalysis.jl:12

It’s worth also mentioning that while playing around with this I saw this Pipeline docs page but I was not able to use it, as I kept getting an error that Pipeline was not defined. I made sure to have MLJBase.jl installed and tried calling MLJBase.Pipeline() but still I got the error… So far @pipeline has been the only thing I’ve been able to get to work!

I suspect this is a package management issue. You need a reasonably recent version of MLJBase to get the new Pipeline syntax.

If you can report the output of using Pkg; Pkg.status() I may be able to guess at the pkg issue.

Alternatively, you can try installing MLJ and BetaML in a fresh environment with using Pkg; Pkg.activate("my_mlj_environemnt", shared=true); Pkg.add("MLJ"); Pkg.add("BetaML") or something like that.

Hello, maintainer of BetaML here.
I confirm that on a temp-activated project the op code results in all negative predictions.
Using low-level BetaML API it is already working :

using HTTP, Pipe, DataFrames, CSV, BetaML

dataURL1 = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
dataURL2 = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test"
data1    = @pipe HTTP.get(dataURL1).body |>  CSV.File(_,missingstring="?",delim=", ",header=false) |> DataFrame
data2    = @pipe HTTP.get(dataURL2).body |>  CSV.File(_,missingstring="?",delim=", ",skipto=2,header=false) |> DataFrame
data     = vcat(data1,data2)

X  = Matrix(data[:,1:14])
y  = data[:,15]
y  = replace.(y,"."=>"")

((xtrain,xtest),(ytrain,ytest)) = partition([X,y],[0.7,0.3]) # shuffle by default
tree  = buildTree(xtrain,ytrain)
ŷtest = predict(tree,xtest)
accuracy(ytest,mode(ŷtest)) # 0.81

I am going to see what’s happening when I interface these functions and I will let you update…

This is my package status output:

(@v1.8) pkg> status
Status `~/.julia/environments/v1.8/Project.toml`
  [f448d96e] BasicMD v0.3.0
  [6e4b80f9] BenchmarkTools v1.3.1
⌃ [024491cd] BetaML v0.7.0
⌅ [336ed68f] CSV v0.8.5
⌃ [a93c6f00] DataFrames v1.3.5
  [864edb3b] DataStructures v0.18.13
⌅ [7806a523] DecisionTree v0.10.13
⌃ [0c46a032] DifferentialEquations v7.3.0
⌃ [31c24e10] Distributions v0.25.71
  [7c232609] Fairness v0.3.2
  [587475ba] Flux v0.13.6
⌃ [713c75ef] Franklin v0.10.77
  [7073ff75] IJulia v1.23.3
  [de52edbc] Integrals v3.1.2
  [b964fa9f] LaTeXStrings v1.3.0
⌃ [add582a8] MLJ v0.16.7
⌅ [a7f614a8] MLJBase v0.18.26
⌃ [c6f25543] MLJDecisionTreeInterface v0.1.3
  [872c559c] NNlib v0.8.9
  [636a865e] NearestNeighborModels v0.2.0
⌃ [7f7a1694] Optimization v3.8.2
  [91a5bcdd] Plots v1.33.0
  [d330b81b] PyPlot v2.11.0
  [1fd47b50] QuadGK v2.5.0
  [2913bbd2] StatsBase v0.33.21
Info Packages marked with ⌃ and ⌅ have new versions available, but those with ⌅ cannot be upgraded. To see why use `status --outdated`

I notice that MLJ is out of date, so I tried to update it and no changes were made. I tried forcing an update to the most recent stable release with ] add MLJ@0.18.5 and get the following error:

(@v1.8) pkg> add MLJ@0.18.5
   Resolving package versions...
ERROR: Unsatisfiable requirements detected for package MLJModels [d491faf4]:
 MLJModels [d491faf4] log:
 ├─possible versions are: 0.1.0-0.15.12 or uninstalled
 ├─restricted to versions * by an explicit requirement, leaving only versions 0.1.0-0.15.12
 ├─restricted by compatibility requirements with MLJ [add582a8] to versions: 0.15.6-0.15.12
 │ └─MLJ [add582a8] log:
 │   ├─possible versions are: 0.1.0-0.18.5 or uninstalled
 │   └─restricted to versions 0.18.5 by an explicit requirement, leaving only versions 0.18.5
 └─restricted by compatibility requirements with Fairness [7c232609] to versions: 0.11.0-0.14.14 — no versions left
   └─Fairness [7c232609] log:
     ├─possible versions are: 0.1.0-0.3.2 or uninstalled
     └─restricted to versions * by an explicit requirement, leaving only versions 0.1.0-0.3.2

It appears to me that MLJModels is the problem, but I’m not sure how to fix this. Any suggestions?

EDIT: upon uninstalling and reinstalling MLJ, it was revealed that Fairness.jl was the issue. Removing both MLJ and Fairness, then reinstalling MLJ with ] add MLJ@0.18.5 got me to version 0.18.5, as desired. And now, pipe = ContinuousEncoder |> Tree() works. I’ll just load my data as @sylvaticus did. Thanks!

On my side I found the issue. As I suspected it was in the interface code, I’ll get a working version of BetaML for tomorrow (but not a full release, as I am working on something else).

Please note that the original dataset is a bit messy and, as the script above shows, it nedds a bit of clean-up…

Thank you! Yes, I am aware the data is a bit messy, that’s what makes it interesting for me! I have a script running that cleans the data and have a predictive model working.

Thanks again for all your help!

@liamfdoherty Yes, I see that Fairness.jl is way behind on maintenance, with several old CompatHelper PR’s not merged.

I have raised an issue there: Maintenance · Issue #78 · ashryaagr/Fairness.jl · GitHub

1 Like