How to sample ordered variables with different distributions in turing

Hi julia bros,

I want to use turing to sample ordered variables each with different distributions. For example,

X~Normal(0,1) Y~Exp(1), Z~Uniform(0,1)

and require

X>Y>Z

I know this can be done by rejection sampling but in case there are lots of ordered variables the efficiency is very low. Is there a way in turing or other packages in julia to help?

Gonzalo

Would try something like

@model function demo()
    Z ~ Uniform(0, 1)
    Y ~ truncated(Exponential(1); lower = Z)
    X ~ truncated(Normal(); lower = Y)
    [X, Y, Z]
end

chain = sample(demo(), NUTS(), 1_000; progress = true)

Not sure if it works correctly though, as it returns some samples violating the constraints … maybe a bug?

chain = sample(demo(), SMC(), 1_000; progress = true);

Seems to return non-violating samples.

Thank you. In your code Z ~ Uniform(0,1) and the others are conditioned on its value but that’s not accurate because Z’s distribution is not affected. A rejection sampling can show that X,Y,Z’s distributions are not what they are specified at the beginning.

Well, yes, that was my interpretation of your definition.
Seems you want a different density though. In this case, you will probably need to define a transformation from \mathbb{R}^3 \to (x, y, z) where x \in [0, 1], y > x, z > y and then use Turing.@addlogprob! to explicitly define the desired density – including the log-Jacobian of the transformation – or define a custom Bijector and use this to implicitly transform the desired density. Have never done this in Turing though.

Following up, does this do what you want?

struct CustomDistribution <: ContinuousMultivariateDistribution
end

Base.length(c::CustomDistribution) = 3

Bijectors.bijector(c::CustomDistribution) = Bijectors.Stacked([Bijectors.Logit(0, 1), identity, identity]) ∘ inverse(Bijectors.OrderedBijector())

Distributions.logpdf(c::CustomDistribution, v::AbstractVector) = sum(logpdf.([Uniform(0, 1), Exponential(1), Normal(0, 1)], v))

@model function demo()
    xyz ~ CustomDistribution()
end

chain = sample(demo(), NUTS(), 1_000; progress = true)

Note: This is just hacked together as MWE. Please check the docs of Distributions and Bijectors in more detail on how to define and use custom distributions!

2 Likes

Is there a bijector which simply wraps the reals mod 1 ? It isn’t 1-1 but it is continuous and a local bijection.
Perhaps I should go and look at the Bijectors code, but here I am asking this question.

(as for the relevance, using such a uniform bijector will generate different sampling distribution, which may somehow correspond better to some desired distribution).

UPDATE: There is a problem though with using such a bijector, as it doesn’t conserve ordering (it is non-monotonic) and wouldn’t work in combination with the second bijector as suggested.

Thank you so much! That appears nicely solve my questions. I am not familiar with Bijectors but what you present gives the same result to rejection sampling.

However if the distributions for X,Y,Z are changed, for example X,Y,Z iid ~ Exponential(1), then I use your code and got mean(X) = 0.28 which however should be 1/3 (can be shown analytically or with rejection sampling).

function rejection_sampling_exp3(n_samples::Int)
samples_X = Float64
samples_Y = Float64
samples_Z = Float64

while length(samples_X) < n_samples
    x = rand(Exponential(1))
    y = rand(Exponential(1))
    z = rand(Exponential(1))
    
    if x < y < z
        push!(samples_X, x)
        push!(samples_Y, y)
        push!(samples_Z, z)
    end
end

return samples_X, samples_Y, samples_Z

end

n_samples = 100000

samples_X, samples_Y, samples_Z = rejection_sampling_exp3(n_samples)

mean_X = mean(samples_X)
mean_Y = mean(samples_Y)
mean_Z = mean(samples_Z)

println(“Mean X: $mean_X”)
println(“Mean Y: $mean_Y”)
println(“Mean Z: $mean_Z”)

I guess this is related to transforming exponentially distributed X~Exp(1) by the Bijectors.

Is there anything that I got wrong? Thanks so much!

This following does rejection sampling, but does it with Turing:

using Turing

@model function demo()
    Z ~ Uniform(0, 1)
    Y ~ Exponential(1)
    X ~ Normal()
    if !(Z < Y < X)
        DynamicPPL.@addlogprob! -Inf
        return
    end
    [X, Y, Z]
end

chain = sample(demo(), NUTS(), 1_000; progress = true)

The average of Z in my test came out to be 0.338

Thanks you much! working pretty well in my exampke. That’s very super clever way.

By the way, do you know how with VI this can be performed?

As you mention yourself, it would not be 1-1, i.e., not a bijector. This will be a problem when trying to compute the Jacobian correction it has on the density – which is needed in your case as you want to define a density on the constrained, i.e., transformed, space.

In this case, you would need to change the bijector to match the support of your distribution:

  • Logit(0, 1): Transforms from a constrained space x \in [0, 1], i.e., the support of uniform, to an unconstrained value y \in \mathbb{R}:
    julia> using Bijectors
    with_logabsdet_jacobian(Bijectors.Logit(0, 1), 0.1)  # (unconstrained value, Jacobian correction)
    
    Turing uses this to define a density p(y) on the unconstrained space which nevertheless has the desired density p(x) when mapped onto the constrained space. By change of variable, this requires to adjust the density by the log-abs-Jacobian as p(y) = p(x) |dx/dy| where dx/dy denotes the Jacobian of the inverse transformation. The Stan manual has a nice explanation on this, but also the source code of Bijectors is rather clear, i.e., just try @less logpdf(transformed(Uniform(0, 1), Bijectors.Logit(0, 1)), 0.2)

Long story short, in case of the Exponential distribution, you need a bijector from \mathbb{R}^+ \to \mathbb{R} instead, i.e., just replace Bijectors.Logit(0, 1)
by bijector(Exponential(1)) – should probably have used bijector(Uniform(0, 1)) in the first place.

Regarding VB: This should work out-of-the-box for the model using CustomDistribution.

This is a bit of a tangent to the thread. But I think even though it is not a bijector, using this transformation is valid, and might be useful. The two caveats needed are:
(i) Continuity of density/transform on the stiched domains of the covering space - without which some algos might get confused (although this may not be a strict requirement)
(ii) A ‘uniform’ covering. For example the mod1 cover of [0,1) with R has the required symmetry (the covering can be partitioned into branches)
With these conditions, the stable distribution on the original space should be maintained (and the sampling will work as intended).

Such more general transformation would be a nice addition (and maybe Stan doesn’t have them).

Still not sure that something like mod 1 is a good idea: First, the posterior on the unconstrained space will be multi-modal as many points map onto the same constrained parameter and accordingly also have the same likelihood. Secondly, and even worse, this means that the posterior will not be proper as it cannot be integrated – without any additional density defined on the unconstrained space directly.
Why do you think such an approach would be better than a 1-1 mapping, e.g., via inverse logit?

I have an intuition it might converge faster.
Speed is not so much of an issue here, but perhaps a different problem can amplify the speed difference.

Would be surprised by that – and in any case, it would only work as most practical samplers are not able to explore the true posterior which is improper and multi-modal.

I don’t understand this comment fully:
Samplers are supposed to sample the distribution which is often multi-modal.
Samplers are not integrating, just sampling. They don’t really know about a global property of the distribution, like whether it is proper.
But I’m not really up to date with all the modern sampling tricks.
This is off-thread topic. So maybe we should let this idea slip through the cracks unless someone picks it up (maybe I will).

This is strictly not correct. MCMC algorithms are numerical integration methods. If certain conditions are satisfied, the resulting samples can be used to approximate certain expectations numerically. Moreover, when those conditions are satisfied, there’s a central limit theorem that bounds the error in the approximation of expectations. MCMC methods cannot sample from an improper distribution. “Improper distributions” are not distributions at all. One could never draw exact samples from an “improper distribution,” so it is also not possible to draw samples using MCMC methods. Multi-modal distributions are in principle no problem for MCMC, but in practice most widely used MCMC algorithms would require infinite time to jump between widely-separated modes. So when possible, it is a good idea to parametrize a model in such a way that it is unimodal.

Few things are.

According to Markov chain Monte Carlo - Wikipedia :
" In statistics, Markov chain Monte Carlo (MCMC ) is a class of algorithms used to draw samples from a probability distribution. " but it is true that:
" MCMC methods are primarily used for calculating numerical approximations ".

You can’t even draw from the Normal Distribution on a computer (everything is finite…).

Also inaccurate, and depends mostly on the width of the chain’s “bottlenecks” (see Mixing Time literature).