I’m wondering if people have implement this before somewhere, like Turing? but I’m not very familiar with Turing and how it is integrated with ADs, so maybe someone could point me to reference:
Given a parameterized distribution p(x | \theta), and a loss
usually since the space of x is too large, instead of calculate the summation directly, we calculate the expectation by sampling x from p
One should not use AD directly for \Gamma here, thus we need to calculate its gradient, which can be derived from summation
which means we can calculate the gradient from those samples we just got as well, so now the problem
is how to implement this efficiently?
I’ve tried some manually implementation for my research before, but then I thought this is something general, there might be something already lying there, but I just don’t know.
But from my own experience, this requires a pool for samples and function trace for each sample, since the forward value for each sample is required for backward use. But to reduce extra allocation, this might be integrate with samplers, or an in-place sampler is required, which insert the sample to the pool each time.
Since I don’t whether this thing exists, we have to implement similar things for temporary with samplers again and again…
If this kind of operator doesn’t exist yet, I’m happy to make a PR to Zygote or somewhere it should belong to, but I guess it’s not quite clear how this should be integrate with samplers to me