[ANN] TableTransforms.jl

TableTransforms.jl is a new package for transforms and pipelines commonly used in statistics and machine learning. It works with general Tables.jl and has some unique features compared to previous attempts:

The package has been submitted for registration and should be available soon as a dependency for other packages. We invite the community to contribute with more tests.

12 Likes

Can this integrate with MLJ?

In my opinion, the question should be: Can MLJ consume TableTransforms.jl?

We are writing a package for transforms with tabular data that is self-contained and has a clean and well-defined API. That way many Julia users can contribute new transforms with ease.

MLJ is a large project with many complex interconnections. If they think TableTransforms.jl should be a dependency, we wlll be happy to help.

1 Like

How does it compare to GitHub - invenia/FeatureTransforms.jl: Transformations for performing feature engineering in machine learning applications ?

1 Like

I explained it in the README @oxinabox . I started TableTransforms.jl because of limitations in the current design of FeatureTransforms.jl. Basically we can revert arbitrarily complex pipelines and exploit multiples threads via the awesome Transducers.jl :heart:

BTW, I tried to contribute to FeatureTransforms.jl before starting the new approach. We realized that a fresh start was really needed to improve the status quo without breaking people’s code.

Also, we do not support arrays by design. This is to make sure that the code doesn’t get messy with keyword arguments. Finally, our transforms are very cheap structs without any reference to the data. So we can create pipelines completely detached from a source.

1 Like

Right.
That would be a disadvantage of the fact that we run FeatureTransformations.jl in a lot of production code.
Breaking changes, require a lot of coordination. (else you end up having to maintain a backports branch forever)

1 Like

Off the top of my head, I can’t think of a reason why I would need to invert a feature transformation pipeline. Target transformations need to be invertible, but I can’t think of a reason why feature transformations need to be invertible. Data normally flows through the pipeline in only one direction. Can you give an example where the invertibility of feature transformations comes in handy?

In geostatistical modeling for example, we need to run the pipeline forward to get clean, pretty, uncorrelated Gaussian, do some additional modeling regarding geospatial correlation, and then revert the estimates. This is a pretty standard workflow in this field.

I can imagine other situations where users are interested in doing analysis on PCA space and then coming back to original ranges to show results, generate insight.

2 Likes

Agreed, and that’s what I meant

1 Like

Some bikeshedding: How about invert and isinvertible instead of revert and isrevertible?

Also, what’s the difference between running

newtable, cache = apply(pipeline, oldtable)
original = revert(pipeline, newtable, cache)

as opposed to just keeping oldtable around if you need it? As far as I can tell, original == oldtable.

Thinking about it some more… I don’t see a way in your package to specify which columns you want to transform, so in order to make a true pipeline you would need to add a Select() transformer, so you could do something like this:

Select(:a) → ZScore()

But Select(:a) is not invertible, so there goes all your invertibility out the window.

Initially I had used isinvertible, but then I realized that the concept we want here is revertibility and not invertibility. We want to go forward and backward in a pipeline, and the concept of inverse is slightly different. Some of these transforms are revertible but not invertible.

Regarding the cache choices, some transforms like Center and ZScore only need to keep track of mu and sigma. In order to save memory in pipelines, we just cache the minimum amount of information necessary to revert the transform. Sequential transforms constructed with \to for example have a cache that is a sequence of caches.

Regarding the Select that would be a nice addition. We had other names in mind though like RowView, ColView and View. Select is not invertible, but we can save the other columns and restore later, so it is revertible.

Interesting. Which transformations are revertible but not invertible?

For example, Parallel transforms take a single table as input, run multiple transforms in parallel and concatenate the columns. It is revertible because one can pick any of the transforms that is revertible and recover the input table, but it is not invertible because there may be multiple paths to revert, each producing a slightly different input table. For example, PCA reconstruction may not be perfect sometimes.

@CameronBieganek I think we can actually provide a revertible Select, we just need to save the other columns and restore later. I will try to add this transform in the following days together with a Discard. Just need to decide what is the most appropriate name for these transforms.

1 Like

Select/Reject transforms added. Adding tests now to make sure that order is preserved in the revert step.

Where can I find documentation in addition to the Readme? e.g. on the several available Transforms.

Do I need to resort to general ML documentation or to docu of FeatureTransforms.jl?

Did you try the docstrings of each transform? For example, ?PCA. Type question mark followed by the name of the transform you are interested.

?Quantile gives me “The quantile transform to a given distribution.”

I am not an ML person but start to explore how I can transform parameters before a Bayesian inversion. I thought that TableTransforms might be a more general alternative to TransformVariables.jl but the current docu does not allow me evaluating this.

Yes, unfortunately the docstrings aren’t ideal. The Quantile transform is a transform that converts the CDF of the input to any given CDF using inverse sampling. I think the closest wikipedia page is

The idea is that you can convert between a CDF1 to a uniform CDF and then to a CDF2. So your transform object Quantile(Normal()) will convert the original CDF to a Normal CDF. You can try any continuous distribution from Distributions.jl as the argument to the Quantile transform.

1 Like