How to create a MLJModelInterface.Model interface of a complex model?

Hi there,
I am trying to build a MLJ interface for some ML algorithms in the BetaML package.

I am starting from the Decision Trees, but I have a few questions.

The function creating (and fitting) the tree is:

buildTree(x, y::Array{Ty,1}; maxDepth = size(x,1), minGain=0.0, minRecords=2, maxFeatures=size(x,2), forceClassification=false, splittingCriterion = (Ty <: Number && !forceClassification) ? variance : gini, mCols=nothing) where {Ty}
  1. As you can see some parameters depend by default by the data, like maxFeatures depends on the dimensionality of the explanation variables. I understood that model parameters should be part of the model struct, but how do I set defaults without seeing the data ?
  2. Even more hard, the algorithm that I am trying to wrap automatically performs a regression or a classification task (and, in the later case, it returns a probability distribution) depending on the type of the label, with the option to override the task with forceClassification. As in ML there are different type of models, probabilistic and deterministic, which one do I choose ? Or should I wrap it as two separate MLJ models ?
  3. Most of my models support Missing data in the input. I read that Missing is a scientific type per se. Should I declare an Union of supported types then, including the Missing ?
  4. I have a case where my model doesn’t fit the fit/predict workflow, that is a model that (using GMM/EM) predicts the missing values in a matrix, based on the degree of similarities of the other elements of the columns to the other rows. How to I wrap it with MLJ ?
  5. Where can I find real-case examples ? For example, DecisionTrees.jl seems to be available through MLJ, but there is no code in the GitHub repo concerning MLJ…

Thank you!

1 Like

While I did somehow managed to write the MLJ interface for a deterministic model, I am trying to write the interface for a probabilistic model whose predict(model,X) method returns a vector of dictionary of label => prob.

I normally use arrays of T for the Y, but I saw that it works also with Y being a CategoricalArray.

However I am stuck here now, and don’t know hot to return the prediction in the format wanted by MLJ.

[EDIT]: I move this post under this Thread to consolidate it…