TransformVariables.jl vs Bijectors.jl

I’ve been using @Tamas_Papp’s TransformVariables.jl for quite a while in my probabilistic programming work, and I’m very happy with it.

Not long ago (weeks? months? one of those) I learned of Bijectors.jl, which looks mostly to be the work of @mohamed82008, @yebai, and @torfjelde.

These packages seem to have identical purpose, with some differences in capabilities. TrasnformVariables has great support for named tuples, for example, while Bijectors seems unique in its coverage of transforms for some distributions, and also has some really nice support for normalizing flows.

I would love for all of this to be available in a single package. Could there be a path toward this?

If you are missing something in TransformVariables.jl, I am happy to take PRs (most of the time it is quite straightforward to code your own transformations, eg I coded the existing ones from the Stan manual; then you can test the Jacobian with AD).

Since the predecessor of TransformVariables.jl (ContinuousTransformations.jl) was around for a while when Bijectors.jl was started, I am assuming that the TuringLang contributors had good reasons to start their own package, but I don’t know what these were.

1 Like

I cannot speak for the reasons why Bijectors.jll was started initially as I was not part of the team then (nor do I know what the other options were at the time). I did consider TransformVariables.jl, which is a really nice package, prior to my work on Bijectors.jl and it seems to be perfect for transforming constrained-to-unconstrained distributions and vice-versa.

But such transforms are really nothing more than functions with some additional structure e.g. invertible, and so I found myself really wanting a package where this is how it felt; you call a bijector as you would a function, you invert a bijector as you would a real number, you compose bijectors as you would functions, etc. As long as you write type-stable code, you also get a lot of neat possibilities when it comes to transformations at compile-time, e.g. b ∘ inv(b) becomes identity. Then the distribution-related functionality is kind of “separate”, where a TransformedDistribution is simply the push-forward of some distribution d induced by a bijector b. This way you can do all kinds of nonsense with the bijectors themselves and then use the resulting bijector push forward any support distribution, e.g. normalizing flows, use in ADVI.

In conclusion, there are certain areas where Bijectors.jl and TransformVariables.jl overlap. For applying a “non-composed” transformation to a named tuple, I think TransformVariables.jl is really nice. If you’re doing complex stuff with the bijectors themselves, especially compositions, I think Bijectors.jl is the way to go.

As for combining efforts, technically (as far as I can tell) Bijectors.jl can do what TransformVariables.jl can do but we don’t have the nice interface with named tuples. Though it should be possible to implement such an interface for Bijectors.jl; we already have a similar thing for stacking together scalar- and vector-transformations in Stacked <: Bijector which uses index-ranges rather than names to decide where to apply which transform.

Unfortunately, I think moving Bijectors.jl to TransformVariables.jl would be difficult due to the significant difference in the underlying design (which I personally like quite a lot :/).

4 Likes

Thanks Tor! I wonder to what extent these can be used together, say with TransformVariables for the named tuple interface and Bijectors for the flows, etc. I’d guess the weirdest part of that could be getting the log-jacobians wired together right.

1 Like

I think the easiest approach would be to just wrap and unwrap methods around Bijectors.jl. Take each of the elements in the named tuple, apply the corresponding transforms separately, and then wrap it into a named tuple again. This way you don’t have to go through the pain of somehow making the compositions work nicely.

A since we’re working with named tuples, you could do this using @generated for maximal efficiency:)

1 Like

You’d still have to manage the Jacobians though, right? Since the two packages have different mechanisms for tracking this.

Makes me wonder about constructors for wrapping a bijector in a transform or vice versa

Indeed, calculating the log Jacobian is the reason why what TransformVariables calls “aggregators” (transform to arrays, tuples, named tuples) are considered transformations.

I have been thinking of decoupling these two things, but I have yet to come up with an interface that is actually an improvement over the current one:

1 Like

@torfjelde : Sorry to dig out a relatively old post - but congratulations to this package! Its really nice and useful!

One question that immediately came up for me is that the package’s mapping functions keep the same dimensionality when mapping from a constrained to an unconstrained space, even if you could potentially decrease dimensions, for instance when you map a Simplex or a Covariance matrix. Example:

prior_dist = Dirichlet([1,2,3,4]) # \in R^4 
bi = Bijectors.transformed(prior_dist)
rand(bi) # \in R^4 with last element fixed to 0

Is there a specific reason for that (I know the module is called Bijectors but still)? I figured a big reason for that package is the support for Turing, and decreasing dimensionality should be good for their supported samplers. If there is a big model where many such parameter are mapped then one could have a substantially lower unconstrained space.

Best regards,

Edit:
I went through the STAN manual who perform said transformation in their sampler, but its not so easy to actually get the dimension right when you calculate gradients wrt the unconstrained space and vice versa. I suppose keeping the dimensions the same and constrain them to zero makes coding quite a bit more comfortable/saves you the cost of getting Jacobians wrt unconstrained parameter at the cost of a larger dimension to sample from.

@mrVeng ; Sorry about really late reply. Haven’t really been active on discourse over the past week (though this I will be more so in the future!). And thank you!

This is indeed because we’re assuming those to be bijectors and thus they need them to be one-to-one mappings. Unfortunately, for the Dirichlet distribution there are two possible approaches, one of which isn’t actually one-to-one. Note that you can create a SimplexBijector(false) to get a bijector for Dirichlet which is actually one-to-one. In Turing, for samplers such as HMC, one should be using false here to ensure the transform is indeed bijective. (In fact, I would say this should be default in Bijectors.jl rather than the true one, so thanks for bringing this to my attention!) For a quick description of the two and when you’d like to use either, you can see https://pytorch.org/docs/stable/distributions.html#torch.distributions.transforms.SoftmaxTransform (hopefully we’ll have equally good docs for Bijectors.jl at some point:) )

In general though, being invertible is a requirement for a Bijector otherwise there’s not much extra structure compared to a regular function (and multiplication with the absolute value of the determinant of the jacobian of the transform doesn’t give you a new distribution).

Despite this we could still allow different output-types; you can have bijections between different spaces. The question is just whether or not we’d like to support that, as it will potentially require a lot more effort from the user to implement a Bijector. We could do something like Bijector{In, Out} and then you define Exp{Real, Real} or something. But this would be quite annoying to have to specify whenever you want to instantiate a Bijector. In the immediate future we’re probably going to add a dimensionality of the input and output to Bijector (assumed to be the same) due to ambiguity-issues when doing batch-computation (see https://github.com/TuringLang/Bijectors.jl/issues/35). In most cases this means no changes, since most bijectors are well-defined only for a particular dimensionality of the input, e.g. SimplexBijector. But there are cases where it’s necessary to specify, e.g. Log.

I hope this answers your question. I’d also suggest bringing this discussion over to the issue I mentioned if you have any further questions:)

1 Like

I was thinking more like just do all transformations in Bijectors.jl, and then you “unwrap” the result afterwards, i.e. replicate the interface in TransformVariables.jl but only using Bijectors.jl to do the computation. For example you could define a composition of two TransformedVariablesWrapper to be a TransformedVariablesWrapper of a Bijectors.Composed, etc.

But yeah, that wouldn’t be any interop, it would just be a replication of the TransformVariables.jl :confused:

1 Like

@mrVeng So my previous description of the SimplexBijector contains a bit of misinformation. I relied on my memory rather than looking at our actual implementation :confused: I had a proper look at our implementation again and consulted with the awesome @mohamed82008.

The issue with the transformation for the Dirichlet distribution, i.e. the SimplexBijector, is that we’re really talking about a K - 1 dimensional space rather than a K dimensional space since the last component of a sample from a Dirichlet is fully defined by the first K - 1 components. So if you want to apply the inverse transformation to y, where y = b(x) for some x in the support of the Dirichlet, it doesn’t matter whether or not you use the inverse of SimplexBijector(true) or SimplexBijector(false); they’ll both give you the same result. But if y is actually something that was not originally on the simplex, when using false you’ll get something which doesn’t sum to 1. Now, all this also means that the jacobian of the transformation is singular / non-invertible which kind of breaks the assumption on a Bijector. So the logabsdetjac of SimplexBijector(false) in this case really returns the logabsdetjac without including the last component (which is 0). This is fine for defining the TransformedDistribution since transforming Dirichlet is really transforming a multivariate distribution over K - 1 variables rather than over K variables despite being represented as K dimensional.

TL;DR: SimplexBijector is a bijection on a K - 1 (sub-)space (of ℝᴷ) despite being represented by a vector of length K. So for the output of the inverse-map to be compatible with logpdf of a Dirichlet, we need don’t want to reduce the number of dimensions.

Hope this clarifies!

Bijectors.jl can do what TransformVariables.jl can do but we don’t have the nice interface with named tuples.

@torfjelde, if you’re interested: I wrote ValueShapes.jl to do just that (mapping between flat vectors and structured representations of parameters, without transformation of values). Maybe Bijectors.jl could make use of it? If you think that would be possible, in principle, but you’re missing some features, please let me know.

I find it a very powerful approach to use the Lens API from Setfield.jl for accessing fields of nested objects. It’s very elegant and composable. For example, Kaleido.jl is my take on extending it. It supports, e.g., getting/setting a subset of fields via a tuple and adding constraints between fields.

Due to the origin of lens in functional programming, Setfield.jl hasn’t supported in-place mutation for a while. But there was a discussion recently on extending it to in-place mutation or mutate-or-widen approach. As a result, there is now a customization point which can be used to add in-place variants.

2 Likes

@torfjelde: Thank you a lot for your detailed answers! I have to say I am quite a bit confused now, so maybe I can try to confirm if I got it roughly right: for the Simplex, you actually have 2 implementations:

  1. SimplexBijector{Val{true}}() #default transformer - a softmax transform which is non-bijective. Here, the last value is mapped to 0 and has no impact on the transformed pdf (and used for dimension matching) - I just checked that via:
using Distributions
using Bijectors

distr = Dirichlet(3,2)
distr_trans = transformed(distr, Bijectors.SimplexBijector{Val{true}}() )
bij = bijector(distr)
inv_bij = inv(bij)

logpdf_with_trans(distr_trans, inv_bij([1.,2,3]), true) == logpdf_with_trans(distr_trans, inv_bij([1.,2,1000]), true) #true

Similar holds for the gradient calculation of the transformed density (which in this case makes no sense then). However, the transform that people who want to use the Bijectors module with MCMC really would like is the bijective one:

  1. SimplexBijector{Val{false}}() #This should be the one that regards the Dirichlet as stick-breaking process, right?

We can view the Dirichlet as Stick-Breaking Process, where the first K-1 pieces describe the final piece. I have seen that both SimplexBijector transform Dirichlet parameter to the unconstrained space equivalently, adding a 0 to the Kth element - so I am more interested in the inverse transform. Here, lets start from the unconstrained space and map back to the original parameter space. One could reverse engineer the mapping via your module by first sampling K-1 parameter, and then add the mysterious 0 to the end (kind of the starting point for my question). Then one can use the inverse of the SimplexBijector{Val{false}} bijector:

Check:

#Simplex bijector
using Distributions
using Bijectors
using ForwardDiff

distr = Dirichlet(3,2)
distr_trans2 = transformed(distr, Bijectors.SimplexBijector{Val{false}}() )
bij2 = Bijectors.SimplexBijector{Val{false}}()
inv_bij2 = inv(bij2)

#logpdf dependent on all parameter
logpdf(distr_trans2, inv_bij2([1.,2,0]) ) == logpdf(distr_trans2, inv_bij2([1.,2,300]) ) #FALSE

# Sample K-1 Parameter, add 0 to end and map back
θ_transformed = push!(randn(2), 0)
θ_constrained = inv_bij2(θ_transformed) # sums up to 1
ll(θ_constrained) = logpdf_with_trans(distr_trans2, θ_constrained, true)

ForwardDiff.gradient(ll, θ_constrained) #also only first 2 elements nonzero, but matches dimension in transformed space

i) Is that about right? What I do not truely get is why I have to (or am I wrong?) add a 0 to the K-1 sampled parameter in order to transfer back via the SimplexBijector{Val{false}} bijector such that the constrained parameter sum up to 1. I guess the advantage of handling the ‘0’ internally does not outweigh the additional difficulties that you have when you need to add the likelihood (wrt to constrained k-dimension parameter) to the transformed (prior) distribution?

ii) In the code above, only the first two elements of the gradient are non-zero, but this time it matches the dimension of the transformed parameter (excluding the 0 that I need to add before taking the inverse), and the logpdf_transformed is now dependent on all parameter. I would assume that Turing would use this transform instead of the default one, and then use only Gradient information of the first K-1 parameter? If so, is there a reason why transform(Dirichlet(..)) assigns by default SimplexBijector{Val{true}}()? I assumed that the other option would be prefered in this case?

Best regards

  1. SimplexBijector{Val{true}}() #default transformer - a softmax transform which is non-bijective. …

Everything in this point is correct, but it’s not a softmax transform (this is what I tried to correct from my previous comment; sorry about that!). It’s the same stick-breaking transformation as SimplexBijector{Val{false}} but setting the last component to zero.

And you’re correct that the abs-det-log-jacobian term doesn’t make any sense in this particular case since for the last component we’d have a zero along the diagonal which we’d subsequently attempt to take the log off. But if you consider the transformation as only working on a (n - 1) subspace, we end up dropping the last component and hence the jacobian is invertible (and (n - 1) by (n - 1) rather than n by n). This way everything works out.

  1. SimplexBijector{Val{false}}() …

You’ve understood this correctly, but

logpdf(distr_trans2, inv_bij2([1.,2,0]) ) == logpdf(distr_trans2, inv_bij2([1.,2,300]) ) #FALSE

should not happen. It does happen, but due to numerical issues rather than an issue with the transform itself

b = Bijectors.SimplexBijector(false)
ib = inv(b)

ib([1., 2, 0]), ib([1., 2, 300])         # => ([0.576117, 0.373355, 0.0505281], [0.576117, 0.373355, 0.0])
sum.((ib([1., 2, 0]), ib([1., 2, 300]))) # => (1.0, 0.949471894068249)

i) Is that about right? What I do not truely get is why I have to (or am I wrong?) add a 0 to the K-1 sampled parameter in order to transfer back via the SimplexBijector{Val{false}} bijector such that the constrained parameter sum up to 1. I guess the advantage of handling the ‘0’ internally does not outweigh the additional difficulties that you have when you need to add the likelihood (wrt to constrained k-dimension parameter) to the transformed (prior) distribution?

I’m not entirely certain what you mean by “does not outweigh the additional difficulties that you have when you need to add the likelihood (wrt to constrained k-dimension parameter) to the transformed (prior) distribution?”. Hopefully this is answered by my explanation above. Just to restate: the inverses of both transformations are identical when restricted to the image of the transformation b (i.e. (n - 1)-dim real space), but outside of this proj = true is NOT a bijection but will give you something that lies on the unit-simplex while proj = false is still a bijection but whose output does not necessarily lie on the unit-simplex.

ii) In the code above, only the first two elements of the gradient are non-zero, but this time it matches the dimension of the transformed parameter (excluding the 0 that I need to add before taking the inverse), and the logpdf_transformed is now dependent on all parameter. I would assume that Turing would use this transform instead of the default one, and then use only Gradient information of the first K-1 parameter? If so, is there a reason why transform(Dirichlet(..)) assigns by default SimplexBijector{Val{true}}()? I assumed that the other option would be prefered in this case?

You’re correct that we use only the first K - 1 parameters, but I’m not sure why you’d then prefer the proj = false. Then you’d be doing extra work only to throw it away afterwards. It indeed can cause issues with the jacobian being singular/non-invertible, but as explained above, this is handled appropriately internally.

Hopefully this helps! I do agree it’s all quite confusing :confused:

Just to be sure I understand… SimplexBijector is presented as a map \mathbb{R}^n \rightarrow \mathbb{R}^n. It’s not bijective, but the restriction to \Delta^{n-1} bijects onto its image. Is that right?

Yes, that’s a good way of putting it :+1: