[ANN] TableTransforms.jl

juliohm · October 28, 2021, 5:26pm

TableTransforms.jl is a new package for transforms and pipelines commonly used in statistics and machine learning. It works with general Tables.jl and has some unique features compared to previous attempts:

The package has been submitted for registration and should be available soon as a dependency for other packages. We invite the community to contribute with more tests.

datnamer · October 28, 2021, 5:27pm

Can this integrate with MLJ?

juliohm · October 28, 2021, 5:35pm

In my opinion, the question should be: Can MLJ consume TableTransforms.jl?

We are writing a package for transforms with tabular data that is self-contained and has a clean and well-defined API. That way many Julia users can contribute new transforms with ease.

MLJ is a large project with many complex interconnections. If they think TableTransforms.jl should be a dependency, we wlll be happy to help.

oxinabox · October 28, 2021, 5:41pm

How does it compare to https://github.com/invenia/FeatureTransforms.jl ?

juliohm · October 28, 2021, 5:43pm

I explained it in the README @oxinabox . I started TableTransforms.jl because of limitations in the current design of FeatureTransforms.jl. Basically we can revert arbitrarily complex pipelines and exploit multiples threads via the awesome Transducers.jl

BTW, I tried to contribute to FeatureTransforms.jl before starting the new approach. We realized that a fresh start was really needed to improve the status quo without breaking people’s code.

Also, we do not support arrays by design. This is to make sure that the code doesn’t get messy with keyword arguments. Finally, our transforms are very cheap structs without any reference to the data. So we can create pipelines completely detached from a source.

oxinabox · October 28, 2021, 5:51pm

Right.
That would be a disadvantage of the fact that we run FeatureTransformations.jl in a lot of production code.
Breaking changes, require a lot of coordination. (else you end up having to maintain a backports branch forever)

CameronBieganek · October 28, 2021, 5:56pm

Off the top of my head, I can’t think of a reason why I would need to invert a feature transformation pipeline. Target transformations need to be invertible, but I can’t think of a reason why feature transformations need to be invertible. Data normally flows through the pipeline in only one direction. Can you give an example where the invertibility of feature transformations comes in handy?

juliohm · October 28, 2021, 6:00pm

In geostatistical modeling for example, we need to run the pipeline forward to get clean, pretty, uncorrelated Gaussian, do some additional modeling regarding geospatial correlation, and then revert the estimates. This is a pretty standard workflow in this field.

I can imagine other situations where users are interested in doing analysis on PCA space and then coming back to original ranges to show results, generate insight.

datnamer · October 28, 2021, 6:04pm

Agreed, and that’s what I meant

CameronBieganek · October 28, 2021, 8:48pm

Some bikeshedding: How about invert and isinvertible instead of revert and isrevertible?

Also, what’s the difference between running

newtable, cache = apply(pipeline, oldtable)
original = revert(pipeline, newtable, cache)

as opposed to just keeping oldtable around if you need it? As far as I can tell, original == oldtable.

CameronBieganek · October 28, 2021, 8:55pm

Thinking about it some more… I don’t see a way in your package to specify which columns you want to transform, so in order to make a true pipeline you would need to add a Select() transformer, so you could do something like this:

Select(:a) → ZScore()

But Select(:a) is not invertible, so there goes all your invertibility out the window.

juliohm · October 28, 2021, 8:57pm

Initially I had used isinvertible, but then I realized that the concept we want here is revertibility and not invertibility. We want to go forward and backward in a pipeline, and the concept of inverse is slightly different. Some of these transforms are revertible but not invertible.

Regarding the cache choices, some transforms like Center and ZScore only need to keep track of mu and sigma. In order to save memory in pipelines, we just cache the minimum amount of information necessary to revert the transform. Sequential transforms constructed with \to for example have a cache that is a sequence of caches.

Regarding the Select that would be a nice addition. We had other names in mind though like RowView, ColView and View. Select is not invertible, but we can save the other columns and restore later, so it is revertible.

CameronBieganek · October 28, 2021, 9:03pm

Interesting. Which transformations are revertible but not invertible?

juliohm · October 28, 2021, 9:07pm

For example, Parallel transforms take a single table as input, run multiple transforms in parallel and concatenate the columns. It is revertible because one can pick any of the transforms that is revertible and recover the input table, but it is not invertible because there may be multiple paths to revert, each producing a slightly different input table. For example, PCA reconstruction may not be perfect sometimes.

juliohm · October 29, 2021, 4:20am

@CameronBieganek I think we can actually provide a revertible Select, we just need to save the other columns and restore later. I will try to add this transform in the following days together with a Discard. Just need to decide what is the most appropriate name for these transforms.

juliohm · October 29, 2021, 2:13pm

Select/Reject transforms added. Adding tests now to make sure that order is preserved in the revert step.

bgctw · February 11, 2022, 7:35pm

Where can I find documentation in addition to the Readme? e.g. on the several available Transforms.

Do I need to resort to general ML documentation or to docu of FeatureTransforms.jl?

juliohm · February 11, 2022, 7:41pm

Did you try the docstrings of each transform? For example, ?PCA. Type question mark followed by the name of the transform you are interested.

bgctw · February 11, 2022, 7:50pm

?Quantile gives me “The quantile transform to a given distribution.”

I am not an ML person but start to explore how I can transform parameters before a Bayesian inversion. I thought that TableTransforms might be a more general alternative to TransformVariables.jl but the current docu does not allow me evaluating this.

juliohm · February 11, 2022, 8:05pm

Yes, unfortunately the docstrings aren’t ideal. The Quantile transform is a transform that converts the CDF of the input to any given CDF using inverse sampling. I think the closest wikipedia page is

The idea is that you can convert between a CDF1 to a uniform CDF and then to a CDF2. So your transform object Quantile(Normal()) will convert the original CDF to a Normal CDF. You can try any continuous distribution from Distributions.jl as the argument to the Quantile transform.

Topic		Replies	Views
[ANN] TableTransforms.jl (cross-post) Package Announcements package , announcement , data , machine-learning , tables	0	473	October 29, 2021
Feedback before TableTransforms.jl v1.0 Data package , data , dataframes , tables	2	503	May 3, 2022
[ANN] TableTransforms.jl v1.0: Transforms and pipelines with tabular data Package Announcements statistics , data , dataframes , machine-learning , tables	2	499	May 5, 2022
Transform several columns of an MLJ model using one transformer Machine Learning question , package , mlj	13	651	November 2, 2023
Common API for tabular data backends Data	44	2649	August 28, 2020

[ANN] TableTransforms.jl

Related topics