Hi everyone, I have mixed type data and want to apply different functions to different types in a pipeline.
Like imputation with mean for continuous and knn imputation for categorical and so onβ¦
Is it possible with MLJ or some other package?
We are already using ScientificTypes.jl to decide whether or not it makes sense to apply certain transforms to continuous vs. categorical columns. We could easily provide a generic transform that knows which columns to select, but this is use-case-specific. Feel free to reach us out in our Zulip machine-learning stream for further questions.
Thanks for the reply but I was aiming for a method that could include inside a pipeline in order to create a more straightfoward and compiled way of treating data.
more like this: https://alan-turing-institute.github.io/MLJ.jl/dev/linear_pipelines/
and the columns transformers from scikit-learn in python
You can select/reject columns with Select and Reject and then apply specific transforms. Later you can join the results with a Parallel transforms. The documentation explains this better.
You can do some of what you want in MLJ. The built-in transformers are documented here. The FillImputer handles mixed types but is pretty basic. There is also MissingImputator (an MLJ model with core algorithm provided by BetaML.jl) for continuous data, which uses EM clustering. Probably not good for larger datasets. You might need to look at BetaML.jl to get detailed docs.
Unlike TableTransforms.jl, feature selection is specified as a transformer hyper-parameter. Currently MLJ has not adopted the split/apply/combine paradigm of TableTransforms.jl (although I think thatβs probably a good idea).
From what i understood from the docs and your reply. MLJ has built-in transformers that handle mixed data types but there is no way of applying certain functions to a certain data (or scientific) type through the pipeline (like tabletransforms.jl) ?
Well, rather than split/apply/combine, you send your table to a model, that model selectively operates on certain columns - for example OneHotEncoder just spawns new columns for the Multiclass types - and the other columns are left untouched. The next model might selectively standardise all Continuous features, and so. So, in principle, you should be able to carry out the same kinds of processing. But you are forced to compute in sequence. So performance may not be as good. I think the TableTransforms.jl approach is better.
That said, if you build a composite model using MLJβs learning network syntax (instead of the βcannedβ linear pipeline syntax) then you have more flexibility. You can do split/apply/combine and a lot more. (For example MLJβs model Stack functionality is implemented using learning networks. And there a PR under review to make learning networks multithreading.) But for routine pre-processing, this might be overkill.