Hello people! Perhaps this question should be in mathstackexchange, and sorry for the lousy formulation! but I feel this subject hits right in the heart of what I consider the Julia community. I’ve recently stumbled upon this paper
Which I find somewhat difficult to follow. I’ve been trying to differentiate the probabilities p_{ij}(t) = \mathbb{P}(X_t=j\;\vert\;X_0 = i) where X_t is a CTRW in a graph starting at i. Say at each node k the jumping-out distribution is \tau(\beta_k) parametrized by a \beta parameter. It would be extremely great if for each stochastic forward pass there was a backward pass to compute the gradient \nabla_\beta p_{ij}(t) (which is a square matrix) for fixed i and fixed t or at least carry this stochastic duals \beta_k+\epsilon_k that the paper describes for each visited k (although that wouldn’t be as flashy but still it would be extremely interesting). This is reminiscent to me of some things I’ve seen in Reinforcement Learning. Is there existing work on this? Although the jumping rates might be continuous r.v. I feel that overall this setting is of a discrete nature so there is no reparametrization trick. Any tip to shed light onto this? Where do I look at? And most importantly: can it be currently done with Stochastic AD?
Hi @aner-sanchez. First, questions: By CTRW, do you mean continuous-time random walk? By jumping-out distribution \tau(\beta_k), do you mean the probability in time of a jump to any next node (aka sojourn time or waiting time)?
While the setting is of a discrete nature, the probability distributions for jumps can still be continuous if there is a transition density for each jump from i to j, p_x(i,j), that is continuous in the differentiation parameter x. That gives the Markov sequence a density that is the distribution of initial conditions multiplied by the density of each step. Because of this, you can use score functions and a reparametrization trick.
That said, I don’t know of a package in Julia, aside from StochasticAD. That package does contain good tutorials and example code to calculate a score function, if that’s what you’re looking for. And I fully sympathize with how hard the paper is to read, even though you clearly have background in this area.
The method in the paper is not about differentiating the probabilities per se, it’s about differentiating an expected value.
More precisely, if you have a random variable X(p) that depends on parameter(s) p, so that its expected value \mathbb{E}[X(p)] is a differentiable function of p, they want to construct another random variable X'(p) (the “stochastic gradient”) such that \mathbb{E}[X'(p)] = \frac{d}{dp} \mathbb{E}[X(p)]. (This is what you need for stochastic gradient descent (SGD).)
For example, if you have a function X(p) = rand() < p ? sin(p) : cos(p) on p \in [0,1], its expected value is p\sin(p) + (1-p)\cos(p), so the goal of AD is to construct another random function (ideally with a small variance) whose expected value is the derivative \sin(p) + p\cos(p) - \cos(p) - (1-p) \sin(p). Of course, in this simple example we should just ditch random variables and evaluate \mathbb{E}[X(p)] analytically, but the point here is to handle cases where that is not possible.
The first author (Gaurav Arya @gaurav-arya) gave an introductory lecture on the basic underpinnings for our 2023 Matrix Calculus course at MIT, and his lecture is on YouTube along with with posted lecture notes. (His lecture mostly focused on the continuous case … he got to the discrete case at the end and focused mainly on explaining the problem; he didn’t have time to explain the solution in detail.)
See also their Julia package StochasticAD.jl.
Yes, handling situations where the usual reparameterization trick doesn’t work is the whole point of this paper, as I understand it.
(I’m not an author on this paper, but I know the authors. I don’t if Gaurav has an account on Julia Discourse, but @ChrisRackauckas does.)
Note that the method described in the paper is mostly a form forward-mode AD — directly analogous to ForwardDiff.jl and dual numbers — but they do describe a preliminary way to extend it to reverse mode too.
And this is a case where StochasticAD would have issues, because it can handle rand(Bernoulli(p)) but not rand() < p, but that’s details. Yes, exactly this kind of problem is what it’s made to solve.
I was trying to differentiate the probabilities as I was trying to differentiate the expected value of
\mathbb{E}(v[X_t])
so the value of a function at a random final node, so the derivatives of the probability w.r.t. parameters appear there. Also I’d say that you can rewrite the probability as an expected value \mathbb{P}_{ij}(t) = \mathbb{E}(\delta_{X_t,j} \vert X_0 = i)
I’m reading a lot about adjoint methods and I feel there needs to be some sort of adjoint random walk to a random walk! Maybe in continuous time Markov chain is just transposing the infinitesimal generator but I think going general would be pretty interesting
I guess it’s just the same forward-backward game here and you want to avoid starting a random walk at every point, don’t know much though I’m just learning, I will read the resources provided above!
Thanks for the answer and sorry for the bad notation yes the terminology you said is what I should have written
I was thinking p_x(i,j) to be independent of the parameters x, if it was dependent then it would sort of be graph learning. I don’t fully follow your Markov sequence sentence, I was thinking also of non-markovian random walks, so as-weird-as-they-can-get soujourn times, but in the moment I don’t how you would compute \nabla_x \log(p_{x}(i,j)) if you have no clue of p_{x}(i,j), maybe I just don’t know enough hahaha, I will read the tutorials!
People use different terminology all of the time. I only asked about it in order to make it easier for people to help, and they are trying to help! But if you feel new to things, then little messages like these may be a distraction. There is a clear review paper called “Monte Carlo Gradient Estimation in Machine Learning” by Mohamed et al (2020). It doesn’t explicitly cover CTRW, but I like these reviews to get a foothold. And if you learn something about using StochasticAD, come back and share. - Drew
JumpProcesses.jl is the main package in Julia for continuous-time random walks. It’s not quite compatible with StochasticAD.jl yet, there’s some efforts going on to fix that. Gaurav does have a Gillespie SSA implementation that is StochasticAD.jl compatible though, but I don’t know where to find it right now. Probably just ping on Slack (doing grant stuff this week so a bit disconnected)
Note that you have an analytical expression for the expected value in terms of the parameterized probabilities, then you can apply ordinary AD to that analytical expression directly.
The point of stochastic AD is to handle cases where it is impractical to evaluate the expected value deterministically, so you also cannot evaluate the derivative deterministically. Instead, you only have a stochastic function that is a unbiased estimator (averaging its samples converges to the desired expected value), and stochastic AD produces another stochastic function that is an unbiased estimator for the derivative. Of course, internally this will involve some form of derivatives of the probability distributions, but as a user you won’t generally be concerned with the precise details of this. The input that you give to stochastic AD is your unbiased estimator, not just some probability function.
At the slight risk of de-focusing the thread here:
I’m working a lot with Gillespie type stochastic processes, coupled to likelihood-free inference with ABC (Approximate Bayesian Computation).
I imagine that stochastic derivatives of the distance function (between model and data statistics) with respect to model parameters would help a lot to explore the parameter space more efficiently. I don’t know, maybe like a Hamiltonian Monte Carlo within ABC, or another form of gradient-based ABC if possible.
Is this something worth to explore? Are you already working on this, or would it be useful to go into this topic a bit? Would be very happy to (roughly) understand the current state of this! The general idea sounds very exciting.
I had to go into ABC-SMC algorithms myself this year, as we wanted to keep track of the “particle weights” as a model evidence estimate for model selection (or normalising constant, or marginal likelihood). Seems to work quite well for problems I encountered so far (ABCdeZ.jl).
From this point of view it would be really fantastic to have gradient-based ABC-SMC that can still keep track of the correct (readjusted maybe) particle weights, to not loose the model evidence estimate. Feels like it should be possible, but don’t know for sure. Happy to join or distribute forces if anything would help.
You can do deterministic differentiation of matrix exponentation with a variety of algorithms, e.g. ChainRules.jl implements the algorithm in Al Mohy et al. (2009). (And, in particular, if you eventually are computing a scalar-valued function, e.g. an ML “loss” function, with the matrix exponential as an intermediate step, then the chain rule can be carried efficiently through the matrix exponential.)
Part of the confusion in the math.stackexchange question you linked is how to think about what the derivative is, and in particular the proposed answer \frac{\partial \exp(DA)}{\partial (DA)} = DA \exp(DA) is a category error: it’s not even the right kind of object for a matrix derivative. What you need is a Fréchet derivative, a linear operator on matrices, which is not simply a matrix of the same size. See e.g. our matrix calculus course, or this short talk on the nature of the problem.
with A\in\mathbb{R}^{n\times n}, v\in\mathbb{R}^n, D=\text{diag}(d) which goes f\colon \mathbb{R}^n \longrightarrow \mathbb{R}^n so you would expect the differential Df at d_0\in\mathbb{R}^n to be a linear transformation Df\colon \mathbb{R}^n \longrightarrow \mathbb{R}^n which is what DA\exp(DA)\text{diag}(Av) is (take D=\text{diag}(d_0), although it was just a guess based on vibes
EDIT: thank you for answering because I was also interested in the case where you don’t multiply by v, I’m learning a lot!