Problem using MLJ and Multiclass scitypes

dmolina · November 1, 2020, 2:02pm

Hello everyone,

I am teaching a course using Scikit-Learn, and I wanted to translate my materials to MLJ (for a possible future ), but I am having many problems. I am using MLJ#dev.

First, I have a dataframe with only 4 attributes, all of them are Multiclass and OrderedFactor. I am trying to apply a DecisionTreeClasifier, that it should be use them as it is said with
https://alan-turing-institute.github.io/MLJ.jl/dev/getting_started/#A-preview-of-data-type-specification-in-MLJ-1

And the types are rights, I think:

println(ScientificTypes.scitype(input_test))

gives me:

Table{Union{AbstractArray{Multiclass{6},1}, AbstractArray{Multiclass{5},1}, AbstractArray{OrderedFactor{4},1}}}

and target_test is from type:

AbstractArray{Multiclass{2},1}

So, I suppose I can apply it.

However, when I do

march = machine(tree, input_test, target_test)

It gives me:

┌ Warning: The scitype of `X`, in `machine(model, X, ...)` is incompatible with `model=DecisionTreeClassifier @140`:
│ scitype(X) = Table{Union{AbstractArray{Multiclass{6},1}, AbstractArray{Multiclass{5},1}, AbstractArray{OrderedFactor{4},1}}}
│ input_scitype(model) = Table{var"#s45"} where var"#s45"<:Union{AbstractArray{var"#s13",1} where var"#s13"<:Continuous, AbstractArray{var"#s13",1} where var"#s13"<:Count, AbstractArray{var"#s13",1} where var"#s13"<:OrderedFactor}.

Which could be the reason?

The evaluate! is working with cross_entropy. However, I am not sure it is working well for the previous warning message.

tlienart · November 1, 2020, 3:06pm

So it’s telling you that the expected scitype for a DecisionTreeClassifier is either Count or Continuous or Ordered Factored and that you’re feeding it something else (multi class) which is not ordered.

So what you should do is coerce the multi class so that they have an ordering. Or you could encode the multi class features first and then apply the decision tree.

What it’s doing here is that it’s effectively taking the multi class features as if they were ordered, using the ordering stored from the categorical array (lexicographic if I’m not mistaken); this may be fine but maybe not (which is why there’s a warning).

Note, possibly of interest: https://github.com/bensadeghi/DecisionTree.jl/issues/92

dmolina · November 1, 2020, 3:44pm

Thank you, @tlienart, I did not know how to understand the warning, I didn’t realise that it was telling me the allowed scitypes by the model. So, I guess, I can coerce all field to OrderedFactors, or using a ContinuousEncoder.

Another question, when I use a model like that, the evaluate! expect a model like cross_entropy, is there a way to use another one (like accuracy)? The warning suggest to use check_measure to false, but when I do that, it gives me an error message.

Thanks in advance.

tlienart · November 1, 2020, 6:48pm

I’m not familiar with check_measure (the docs should have something on it though, otherwise you might want to open an issue about it but typically Anthony writes pretty exhaustive docs).

That said, I think it’s important to get the philosophy of MLJ: models have some input type they allow (as you saw) and also an “output type”, i.e. they can be Deterministic or Probabilistic. The former returns point wise estimators whereas the latter returns scores (“probabilities”). You can get this using the info function; if you’re curious about the code you can also see that in the interface: https://github.com/alan-turing-institute/MLJModels.jl/blob/52a6e64807577e256b15b7be5f26383fa5ca3f2a/src/DecisionTree.jl#L80

What that means is that, when you call predict on a machine with a decision tree model, you get a score out of it, not just a single class. To compute a metric of performance such as the accuracy which requires a point-wise estimator, you need to pass that score through something. e.g. typically take the class with the highest score.

So in short if you have a 3-class classification problem, the output of a prediction of the machine with a DTC might be [0.3, 0.2, 0.5] and to compute the accuracy then you have to do one extra step which is to take the class corresponding to the highest score (0.5). You can do this automatically by using the mode function. You can do both together using predict_mode (all this is in the docs btw, there’s also a tutorial on it)

Cross Entropy expects a score which explains why it worked fine out of the box with it.

dmolina · November 2, 2020, 5:23pm

Thank you for the complete explanaition, I have already applied the suggestions to obtain the measure.

It is true that both frameworks (Scikit-learn and MLJ) have different philosophies, and sometimes a simple translation in the solution of a problem is not intuitive. However, after reading carefully the excelent documentation it is simple to do it in a more MLJish way :-). When I have finished the examples, both in python with scikit-learn and in Julia with MLJ, I will post them.

Topic		Replies	Views
Failed prediction from BetaML DecisionTreeClassifier Machine Learning question	10	583	September 22, 2022
MLJ - Error - Models support multitarget but measures do not? Machine Learning mlj	2	669	January 10, 2021
Unable to evaluate inputs comprising columns of either String or Int type: convert error Machine Learning mlj	8	365	November 16, 2023
Using measure in MLJ to evaluate binary classifier New to Julia machine-learning , mlj	2	1504	August 31, 2021
How to create a MLJModelInterface.Model interface of a complex model? Machine Learning mlj	1	365	February 25, 2021

Problem using MLJ and Multiclass scitypes

Related topics