Pipeline for different columns

jfcal · May 23, 2022, 7:52am

Hi everyone, I have mixed type data and want to apply different functions to different types in a pipeline.
Like imputation with mean for continuous and knn imputation for categorical and so on…
Is it possible with MLJ or some other package?

Thanks

juliohm · May 23, 2022, 10:51am

Are you aware of

We are already using ScientificTypes.jl to decide whether or not it makes sense to apply certain transforms to continuous vs. categorical columns. We could easily provide a generic transform that knows which columns to select, but this is use-case-specific. Feel free to reach us out in our Zulip machine-learning stream for further questions.

jfcal · May 23, 2022, 1:21pm

Thanks for the reply.

From what I am seeing, it applies automatically from the scientific types right? There is no way for manually selecting?

Thank for the input, i will check the package out.

pdeffebach · May 23, 2022, 1:24pm

In DataFrames.jl, you can select columns by type, i.e.

julia> using DataFrames

julia> df = DataFrame(a = [1, 2], b = ["x", "y"], c = [1.5, 3.5]);

julia> df[:, names(df, String)]
2×1 DataFrame
 Row │ b      
     │ String 
─────┼────────
   1 │ x
   2 │ y

julia> df[:, names(df, Real)]
2×2 DataFrame
 Row │ a      c       
     │ Int64  Float64 
─────┼────────────────
   1 │     1      1.5
   2 │     2      3.5

So you can subset columns as needed and operate on them. You can use the same strategy inside transform calls.

jfcal · May 23, 2022, 1:34pm

Thanks for the reply but I was aiming for a method that could include inside a pipeline in order to create a more straightfoward and compiled way of treating data.
more like this:
https://alan-turing-institute.github.io/MLJ.jl/dev/linear_pipelines/
and the columns transformers from scikit-learn in python

juliohm · May 23, 2022, 3:31pm

You can select/reject columns with Select and Reject and then apply specific transforms. Later you can join the results with a Parallel transforms. The documentation explains this better.

jfcal · May 23, 2022, 4:09pm

ok thanks!

ablaom · May 23, 2022, 9:48pm

You can do some of what you want in MLJ. The built-in transformers are documented here. The FillImputer handles mixed types but is pretty basic. There is also MissingImputator (an MLJ model with core algorithm provided by BetaML.jl) for continuous data, which uses EM clustering. Probably not good for larger datasets. You might need to look at BetaML.jl to get detailed docs.

Unlike TableTransforms.jl, feature selection is specified as a transformer hyper-parameter. Currently MLJ has not adopted the split/apply/combine paradigm of TableTransforms.jl (although I think that’s probably a good idea).

ablaom · May 23, 2022, 10:47pm

And the transformer OneHotEncoder now supports missing values.

jfcal · May 24, 2022, 7:23am

Thanks for the reply, @ablaom .

From what i understood from the docs and your reply. MLJ has built-in transformers that handle mixed data types but there is no way of applying certain functions to a certain data (or scientific) type through the pipeline (like tabletransforms.jl) ?

ablaom · May 25, 2022, 9:02pm

Well, rather than split/apply/combine, you send your table to a model, that model selectively operates on certain columns - for example OneHotEncoder just spawns new columns for the Multiclass types - and the other columns are left untouched. The next model might selectively standardise all Continuous features, and so. So, in principle, you should be able to carry out the same kinds of processing. But you are forced to compute in sequence. So performance may not be as good. I think the TableTransforms.jl approach is better.

That said, if you build a composite model using MLJ’s learning network syntax (instead of the “canned” linear pipeline syntax) then you have more flexibility. You can do split/apply/combine and a lot more. (For example MLJ’s model Stack functionality is implemented using learning networks. And there a PR under review to make learning networks multithreading.) But for routine pre-processing, this might be overkill.

jfcal · May 29, 2022, 9:08am

thanks for the reply. Will try it out!

Topic		Replies	Views
Transform several columns of an MLJ model using one transformer Machine Learning question , package , mlj	13	692	November 2, 2023
Pipeline : from raw data to fitted model? Machine Learning first-steps	6	1274	February 6, 2019
Query.jl: How can I write type-preserving methods for iterable tables Data question	2	727	February 13, 2019
[ANN] TableTransforms.jl Data package , announcement , data , machine-learning , tables	22	1916	February 11, 2022
Julia: DataFramesMeta Transformation Data question , package	4	1805	April 30, 2017

Pipeline for different columns

Related topics