Problem using MLJ and Multiclass scitypes

Hello everyone,

I am teaching a course using Scikit-Learn, and I wanted to translate my materials to MLJ (for a possible future :slight_smile: ), but I am having many problems. I am using MLJ#dev.

First, I have a dataframe with only 4 attributes, all of them are Multiclass and OrderedFactor. I am trying to apply a DecisionTreeClasifier, that it should be use them as it is said with
https://alan-turing-institute.github.io/MLJ.jl/dev/getting_started/#A-preview-of-data-type-specification-in-MLJ-1

And the types are rights, I think:

println(ScientificTypes.scitype(input_test))

gives me:

Table{Union{AbstractArray{Multiclass{6},1}, AbstractArray{Multiclass{5},1}, AbstractArray{OrderedFactor{4},1}}}

and target_test is from type:

AbstractArray{Multiclass{2},1}

So, I suppose I can apply it.

However, when I do

march = machine(tree, input_test, target_test)

It gives me:

┌ Warning: The scitype of `X`, in `machine(model, X, ...)` is incompatible with `model=DecisionTreeClassifier @140`:
│ scitype(X) = Table{Union{AbstractArray{Multiclass{6},1}, AbstractArray{Multiclass{5},1}, AbstractArray{OrderedFactor{4},1}}}
│ input_scitype(model) = Table{var"#s45"} where var"#s45"<:Union{AbstractArray{var"#s13",1} where var"#s13"<:Continuous, AbstractArray{var"#s13",1} where var"#s13"<:Count, AbstractArray{var"#s13",1} where var"#s13"<:OrderedFactor}.

Which could be the reason?

The evaluate! is working with cross_entropy. However, I am not sure it is working well for the previous warning message.

So it’s telling you that the expected scitype for a DecisionTreeClassifier is either Count or Continuous or Ordered Factored and that you’re feeding it something else (multi class) which is not ordered.

So what you should do is coerce the multi class so that they have an ordering. Or you could encode the multi class features first and then apply the decision tree.

What it’s doing here is that it’s effectively taking the multi class features as if they were ordered, using the ordering stored from the categorical array (lexicographic if I’m not mistaken); this may be fine but maybe not (which is why there’s a warning).

Note, possibly of interest: https://github.com/bensadeghi/DecisionTree.jl/issues/92

Thank you, @tlienart, I did not know how to understand the warning, I didn’t realise that it was telling me the allowed scitypes by the model. So, I guess, I can coerce all field to OrderedFactors, or using a ContinuousEncoder.

Another question, when I use a model like that, the evaluate! expect a model like cross_entropy, is there a way to use another one (like accuracy)? The warning suggest to use check_measure to false, but when I do that, it gives me an error message.

Thanks in advance.

I’m not familiar with check_measure (the docs should have something on it though, otherwise you might want to open an issue about it but typically Anthony writes pretty exhaustive docs).

That said, I think it’s important to get the philosophy of MLJ: models have some input type they allow (as you saw) and also an “output type”, i.e. they can be Deterministic or Probabilistic. The former returns point wise estimators whereas the latter returns scores (“probabilities”). You can get this using the info function; if you’re curious about the code you can also see that in the interface: https://github.com/alan-turing-institute/MLJModels.jl/blob/52a6e64807577e256b15b7be5f26383fa5ca3f2a/src/DecisionTree.jl#L80

What that means is that, when you call predict on a machine with a decision tree model, you get a score out of it, not just a single class. To compute a metric of performance such as the accuracy which requires a point-wise estimator, you need to pass that score through something. e.g. typically take the class with the highest score.

So in short if you have a 3-class classification problem, the output of a prediction of the machine with a DTC might be [0.3, 0.2, 0.5] and to compute the accuracy then you have to do one extra step which is to take the class corresponding to the highest score (0.5). You can do this automatically by using the mode function. You can do both together using predict_mode (all this is in the docs btw, there’s also a tutorial on it)

Cross Entropy expects a score which explains why it worked fine out of the box with it.

4 Likes

Thank you for the complete explanaition, I have already applied the suggestions to obtain the measure.

It is true that both frameworks (Scikit-learn and MLJ) have different philosophies, and sometimes a simple translation in the solution of a problem is not intuitive. However, after reading carefully the excelent documentation it is simple to do it in a more MLJish way :-). When I have finished the examples, both in python with scikit-learn and in Julia with MLJ, I will post them.

5 Likes