Transform several columns of an MLJ model using one transformer

ParadaCarleton · October 24, 2023, 7:17am

Is there an easy way to transform several columns of a table using the same MLJ transformer? So if I have a transformer like UnivariateBoxCoxTransformer, I’d like to apply it only to columns 2, 3, and 4.

Alternatively, some way to avoid applying a transformer (like Standardizer) to some columns.

juliohm · October 24, 2023, 8:09am

If you can’t find a solution with MLJ.jl take a look at TableTransforms.jl.

ParadaCarleton · October 24, 2023, 9:13pm

Thanks, but unfortunately, I’m looking for an answer to make a PR to MLJ.jl, and I’d rather not add a dependency

ablaom · October 24, 2023, 10:26pm

I’ve long thought we should have this, but we don’t. An issue was opened some time ago:

github.com/JuliaAI/MLJModels.jl

Universal table transformer combining univariate transformations dispatched on schema

opened 10:06PM - 04 Aug 20 UTC

ablaom

It has been proposed on Slack that it be possible to have a single table transfo…rmer that transforms individual columns according to **user-specified** univariate transformations. This sounds like a good idea, which would also force some uniformity that's a little bit lacking in the current collection of table transformers. 1. In the most general case I can imagine implementing, the univariate transformer that applies to a particular column is defined by a function that operates on both the `name` and `scitype` of the the column (as encoded in the table `schema`). This has the disadvantage that the user must specify a function with two arguments - or interact through some other complicated interface. 2. The alternative would be a compositional approach. Each tabular transformer only carries out a *single* univariate transformer, applying to all specified `names` and `scitypes` (or "not"-names and "not"-scitypes, through `ignore` Boolean parameter), which would cover all conceivable use-cases. (columns not referred to are left alone). However, as we are currently locked into Tables.jl (which are non-mutable in general) we get a lot more copying of data. Thoughts anyone?

math4mad · October 25, 2023, 12:40am

Please! It is very cumbersome, like this


# 1. loading  package
using  DataFrames,TableTransforms
using  Random
Random.seed!(34343)

# 2. loading  data -> dataframe
df=load_csv("BostonHousing")

# 3. tranformation
table=(eachcol(df))|>Functional(
                  1=>log,
                  2=>x->x/10,
                  3=>log,
                  4=>x->x,
                  5=>log,
                  6=>log,
                  7=>x->(x^2.5)/10000,
                  8=>log,
                  9=>log,
                  10=>log,
                  11=>x->exp(0.4*x)/1000,
                  12=>x->x/100,
                  13=>x->sqrt(x),
                  14=>log
)
first(table,10)

[1,3,5,6,8,9,10,14] better be one line

ParadaCarleton · October 25, 2023, 3:15am

Damn. So TIL about Functional, which is a huge quality-of-life improvement already. Actually, as I’m reading more about TableTransforms.jl, I’m thinking this is a great consolidation opportunity; most of the built-in transformations in MLJModels.jl might fit better in TableTransforms.

juliohm · October 25, 2023, 10:29am

TableTransforms.jl has very flexible column selection features:

Functional([1,3,5,6,8,9,10,14] => log)

You can use lists of symbols, strings, integers, regex, …

ParadaCarleton · October 25, 2023, 4:57pm

Ooh, neat! Perhaps this could be more clearly documented?

juliohm · October 25, 2023, 5:04pm

All transforms should have a clear docstring explaining these options.

math4mad · October 27, 2023, 6:32am

this method doesn’t work right now

juliohm · October 27, 2023, 11:22am

You need to be more explicit about which method doesn’t work. Can you please share a MWE?

ParadaCarleton · October 29, 2023, 5:04pm

Is there a way to use TableTransforms.jl together with MLJ transforms?

ablaom · October 30, 2023, 2:32am

Not currently. The main issues are:

(abstract type roadblock) MLJModelInterface requires new algorithms to subtype an abstract type owned by MLJModelInterface (Unsupervised or Static) but TableTransforms.jl, as I understand it, is trying for a pure functional interface, and without depending on externally owned types.
(limitation on functionality) The MLJTuning.jl API for tuning models is based on mutation of the hyperparameter struct, and so not suited to TableTransforms.jl transformer structs, which are immutable. This currently rules out optimization of transformer hyperparameters in MLJ pipelines.

One day MLJ may rid itself of its abstract model type hierarchy (for efforts in this direction, see this announcement). However, it is substantially embedded in the ecosystem and unlikely to disappear in the near future.

A simple, but unattractive, solution to 1. would be for MLJModels.jl or TableTransforms.jl to provide a wrapper. The only way I can think of to avoid the wrapper in the status quo would require metaprogramming hacks that would likely be brittle.

ParadaCarleton · November 2, 2023, 1:02am

Sounds good to me!

Topic		Replies	Views
Pipeline for different columns Machine Learning	11	412	May 29, 2022
[ANN] TableTransforms.jl Data package , announcement , data , machine-learning , tables	22	1855	February 11, 2022
Feedback before TableTransforms.jl v1.0 Data package , data , dataframes , tables	2	503	May 3, 2022
[ANN] TableTransforms.jl v1.0: Transforms and pipelines with tabular data Package Announcements statistics , data , dataframes , machine-learning , tables	2	499	May 5, 2022
[ANN] TableTransforms.jl (cross-post) Package Announcements package , announcement , data , machine-learning , tables	0	473	October 29, 2021

Transform several columns of an MLJ model using one transformer

Related topics