Backward pass for sampling

I’m wondering if people have implement this before somewhere, like Turing? but I’m not very familiar with Turing and how it is integrated with ADs, so maybe someone could point me to reference:

Given a parameterized distribution p(x | \theta), and a loss

\Gamma(\theta) = \sum_x f(x) * p(x|\theta)

usually since the space of x is too large, instead of calculate the summation directly, we calculate the expectation by sampling x from p

\Gamma(\theta) = E_{x \sim p(x|\theta)} f(x)

One should not use AD directly for \Gamma here, thus we need to calculate its gradient, which can be derived from summation

\begin{aligned} \Gamma(\theta) &= \frac{\partial}{\partial \theta} \sum_x f(x) * p(x|\theta)\\ &= \sum_x f(x) \frac{\partial}{\partial \theta} p(x|\theta) &= E_{x\sim p(x|\theta)} \frac{f(x)}{p(x|\theta)} \frac{\partial}{\partial \theta} p(x|\theta) \end{aligned}

which means we can calculate the gradient from those samples we just got as well, so now the problem
is how to implement this efficiently?

I’ve tried some manually implementation for my research before, but then I thought this is something general, there might be something already lying there, but I just don’t know.

But from my own experience, this requires a pool for samples and function trace for each sample, since the forward value for each sample is required for backward use. But to reduce extra allocation, this might be integrate with samplers, or an in-place sampler is required, which insert the sample to the pool each time.

Since I don’t whether this thing exists, we have to implement similar things for temporary with samplers again and again…

If this kind of operator doesn’t exist yet, I’m happy to make a PR to Zygote or somewhere it should belong to, but I guess it’s not quite clear how this should be integrate with samplers to me


Your example doesn’t quite make sense: Do you mean

\Gamma(\theta) = E_{X \sim p(\cdot | \theta)}[f(X)] = \int_\Omega f(x) p(x|\theta) dx

but you want to compute

\frac{d}{d\theta} \Gamma(\theta) = \int_\Omega f(x) \frac{d}{d\theta}p(x|\theta) dx = E_{X \sim p(\cdot|\theta)} \left[ f(x) \frac{d}{d\theta} \log p(x|\theta) \right]


\frac{d}{d\theta}p(x|\theta) = p(x|\theta) \frac{\frac{d}{d\theta}p(x|\theta)}{p(x|\theta)} = p(x|\theta) \frac{d}{d\theta} \log p(x|\theta)


If so, this reduces to computing the gradient of the logpdf at the sampled values.

1 Like

Yes, that’s what I mean.

I think computing the gradient of logpdf at the sampled values is the way we were doing this, but I haven’t seen any existing samplers/AD operators that let you do this with a few lines. (like an operator)

I have to write my own sampler and maintain the memory for gradient for myself each time (to accumulate the gradient within sampling, so I don’t need to store all the trace in AD)

So I’m just wondering if someone has already implemented similar thing somewhere, or this is not exist?

1 Like