Pipeline for different columns

Hi everyone, I have mixed type data and want to apply different functions to different types in a pipeline.
Like imputation with mean for continuous and knn imputation for categorical and so on…
Is it possible with MLJ or some other package?

Thanks

Are you aware of

We are already using ScientificTypes.jl to decide whether or not it makes sense to apply certain transforms to continuous vs. categorical columns. We could easily provide a generic transform that knows which columns to select, but this is use-case-specific. Feel free to reach us out in our Zulip machine-learning stream for further questions.

1 Like

Thanks for the reply.

From what I am seeing, it applies automatically from the scientific types right? There is no way for manually selecting?

Thank for the input, i will check the package out.

In DataFrames.jl, you can select columns by type, i.e.

julia> using DataFrames

julia> df = DataFrame(a = [1, 2], b = ["x", "y"], c = [1.5, 3.5]);

julia> df[:, names(df, String)]
2Γ—1 DataFrame
 Row β”‚ b      
     β”‚ String 
─────┼────────
   1 β”‚ x
   2 β”‚ y

julia> df[:, names(df, Real)]
2Γ—2 DataFrame
 Row β”‚ a      c       
     β”‚ Int64  Float64 
─────┼────────────────
   1 β”‚     1      1.5
   2 β”‚     2      3.5

So you can subset columns as needed and operate on them. You can use the same strategy inside transform calls.

Thanks for the reply but I was aiming for a method that could include inside a pipeline in order to create a more straightfoward and compiled way of treating data.
more like this:
https://alan-turing-institute.github.io/MLJ.jl/dev/linear_pipelines/
and the columns transformers from scikit-learn in python

You can select/reject columns with Select and Reject and then apply specific transforms. Later you can join the results with a Parallel transforms. The documentation explains this better.

1 Like

ok thanks!

You can do some of what you want in MLJ. The built-in transformers are documented here. The FillImputer handles mixed types but is pretty basic. There is also MissingImputator (an MLJ model with core algorithm provided by BetaML.jl) for continuous data, which uses EM clustering. Probably not good for larger datasets. You might need to look at BetaML.jl to get detailed docs.

Unlike TableTransforms.jl, feature selection is specified as a transformer hyper-parameter. Currently MLJ has not adopted the split/apply/combine paradigm of TableTransforms.jl (although I think that’s probably a good idea).

1 Like

And the transformer OneHotEncoder now supports missing values.

Thanks for the reply, @ablaom .

From what i understood from the docs and your reply. MLJ has built-in transformers that handle mixed data types but there is no way of applying certain functions to a certain data (or scientific) type through the pipeline (like tabletransforms.jl) ?

Well, rather than split/apply/combine, you send your table to a model, that model selectively operates on certain columns - for example OneHotEncoder just spawns new columns for the Multiclass types - and the other columns are left untouched. The next model might selectively standardise all Continuous features, and so. So, in principle, you should be able to carry out the same kinds of processing. But you are forced to compute in sequence. So performance may not be as good. I think the TableTransforms.jl approach is better.

That said, if you build a composite model using MLJ’s learning network syntax (instead of the β€œcanned” linear pipeline syntax) then you have more flexibility. You can do split/apply/combine and a lot more. (For example MLJ’s model Stack functionality is implemented using learning networks. And there a PR under review to make learning networks multithreading.) But for routine pre-processing, this might be overkill.

1 Like

thanks for the reply. Will try it out!

1 Like