Backward pass for sampling

Roger-luo · April 12, 2019, 10:39am

I’m wondering if people have implement this before somewhere, like Turing? but I’m not very familiar with Turing and how it is integrated with ADs, so maybe someone could point me to reference:

Given a parameterized distribution p(x | \theta), and a loss

\Gamma(\theta) = \sum_x f(x) * p(x|\theta)

usually since the space of x is too large, instead of calculate the summation directly, we calculate the expectation by sampling x from p

\Gamma(\theta) = E_{x \sim p(x|\theta)} f(x)

One should not use AD directly for \Gamma here, thus we need to calculate its gradient, which can be derived from summation

\begin{aligned} \Gamma(\theta) &= \frac{\partial}{\partial \theta} \sum_x f(x) * p(x|\theta)\\ &= \sum_x f(x) \frac{\partial}{\partial \theta} p(x|\theta) &= E_{x\sim p(x|\theta)} \frac{f(x)}{p(x|\theta)} \frac{\partial}{\partial \theta} p(x|\theta) \end{aligned}

which means we can calculate the gradient from those samples we just got as well, so now the problem
is how to implement this efficiently?

I’ve tried some manually implementation for my research before, but then I thought this is something general, there might be something already lying there, but I just don’t know.

But from my own experience, this requires a pool for samples and function trace for each sample, since the forward value for each sample is required for backward use. But to reduce extra allocation, this might be integrate with samplers, or an in-place sampler is required, which insert the sample to the pool each time.

Since I don’t whether this thing exists, we have to implement similar things for temporary with samplers again and again…

If this kind of operator doesn’t exist yet, I’m happy to make a PR to Zygote or somewhere it should belong to, but I guess it’s not quite clear how this should be integrate with samplers to me

simonbyrne · April 12, 2019, 4:25pm

Your example doesn’t quite make sense: Do you mean

\Gamma(\theta) = E_{X \sim p(\cdot | \theta)}[f(X)] = \int_\Omega f(x) p(x|\theta) dx

but you want to compute

\frac{d}{d\theta} \Gamma(\theta) = \int_\Omega f(x) \frac{d}{d\theta}p(x|\theta) dx = E_{X \sim p(\cdot|\theta)} \left[ f(x) \frac{d}{d\theta} \log p(x|\theta) \right]

since

\frac{d}{d\theta}p(x|\theta) = p(x|\theta) \frac{\frac{d}{d\theta}p(x|\theta)}{p(x|\theta)} = p(x|\theta) \frac{d}{d\theta} \log p(x|\theta)

?

If so, this reduces to computing the gradient of the logpdf at the sampled values.

Roger-luo · April 14, 2019, 5:13pm

Yes, that’s what I mean.

I think computing the gradient of logpdf at the sampled values is the way we were doing this, but I haven’t seen any existing samplers/AD operators that let you do this with a few lines. (like an operator)

I have to write my own sampler and maintain the memory for gradient for myself each time (to accumulate the gradient within sampling, so I don’t need to store all the trace in AD)

So I’m just wondering if someone has already implemented similar thing somewhere, or this is not exist?

Topic		Replies	Views
Use FiniteDiff.jl in Turing.jl Probabilistic programming turing	20	478	May 9, 2024
Recommended way to extract logprob and build gradients in Turing.jl? Probabilistic programming turing	2	740	October 5, 2022
Gradient of latent space variable in Turing? General Usage	1	365	May 12, 2021
Speeding up gradient of logpdf Machine Learning question , performance , autodiff	19	764	February 12, 2024
Noise with gradient works! General Usage	26	587	May 14, 2024

Backward pass for sampling

Related topics