Is autodifferentiation possible in this situation?

Dear all,
I am here to ask the help of the community. I am a Physicist, working in the domain of Astrophysics/Cosmology. About a month ago I discovered Julia and I started translating my Python code in this new language: the new code is about 10x faster than the Python optimized code. After this results, I decided I will use Julia for my next research projects.

After this brief introduction, my questions regard the usage of autodifferentiation, since in my project I need to evaluate several numerical derivatives.

Detailed explanation

In my work I have to evaluate coefficients such as

C_{i j}^{A B}(\ell)=\frac{c}{H_{0}} \int_{z_{\min }}^{z_{\max }} \mathrm{d} z \frac{W_{i}^{A}(z) W_{j}^{B}(z)}{E(z) r^{2}(z)} P_{\delta \delta}\left(\frac{\ell+1 / 2}{r(z)}, z\right)

In this expression there are several terms:

  • E(z) , for which I have an analytic expression
  • r(z) , which is evaluated as the 1D integral of E(z) (which, for instance, I evaluate using QuadGK.jl)
  • W_{i}^{A}(z) , which is composed by analytic functions and/or E(z) and r(z)
  • P_{\delta \delta} , which is obtained as the numerical solution of a system of a differential equations ( if one is interested in high precision results) or using some analytical approximations

Altough not explicited stated, all these functions depends on some parameters (the cosmological parameters I am interested in). In order to evaluate the Fisher matrix (the final goal of my project), I need the derivatives of these coefficients C_{i j}^{A B}(\ell) with respect to the aforementioned parameters. Since the integrand is the (numerical) solution of a ODE system and the integration is performed numerically (with Newton-Cotes or Romberg techniques), finite differences techniques are not stable and I employ a technique (described here ) based on linear regression. I am quite satisfied with the numerical stability and precision of this algorithm, but this requires me to evaluate the previous coefficients for 14 values of each parameter. I want to stress that nor the integration interval or the integration variable depends upon the parameters wrt I want to evaluate the derivatives. I started reading something about autodiff and so I asked myself if I could evaluate the derivatives I need using this technique.

  1. Is it possible to use autodiff with functions which are obtained from numerical integrals?
  2. Is it possible to use autodiff with functions which are obtained from numerical integrals where the integrand is the numerical solution of a ODE system, or the integrand must be an analytic function? If this is possible, the ODE solver must be written in Julia or could be also a package written in another language (e.g. C)?
  3. If some of this options are possible, which packages should I consider to implement in my project? Zygote.jl and/or something else?

Furthermore, since my knowledge of autodiff is quite limited, if someone could give me the link ot some good resources, this would be really appreciated.

Thank you for your time:)

Cheers,
Marco Bonici

3 Likes

The TLDR is

  1. yes
  2. yes
  3. Zygote should just work.

For a really good high level overview, I’d recommend JuliaCon 2016 | ForwardDiff.jl: Fast Derivatives Made Easy | Jarrett Revels - YouTube. It’s a few years old, but holds up pretty well.

2 Likes

To expand a little:

  1. If you’re looking to use autodiff through ODE solves then this is supported in DifferentialEquations.jl, which is compatible with the autodiff provided by Flux.
  2. It’s worth noting that (very broadly speaking) there are two ways of autodiff’ing through ODE solves. The first way is to differentiate through the internal operations of the solver. This does require the ODE solver to be written in a way that the autodiff understands. (This is known as “discretise-then-optimise”.)

However the other way is to first differentiate the continuous-time ODE. This in fact gives another ODE, running backwards-in-time. Then numerically solve this backwards-in-time ODE to compute the gradients. (This is known as “optimise-then-discretise”.)

Generally speaking the first way is preferred (it’s usually faster and gives more accurate gradients). The reason I mention this is that the latter way does actually make it possible to run autodiff on ODE solvers written in some other package or some other language: set up the backwards-in-time ODE, then solve it using your external package. If you’re coming from Python then an explicit example is provided by torchdiffeq, which is capable of differentiating through the SciPy solvers this way.

  1. Indeed Zygote/Flux is probably the way to go.

You may like this as a resource for getting started learning autodiff.

5 Likes

There’s a whole extensive system for this, complete with each of the options you wanted and the options you didn’t know you wanted. The short story is, if you stick an ODE solve into ForwardDiff, ReverseDiff, Tracker, or Zygote, or ModelingToolkit, or AutoGrad.jl (or Nabla? Never treid that one) calls it’ll work. Hell, even TensorFlow will differentiate the DifferentialEquations.jl ODE solvers.

If you want the longer story:

https://diffeq.sciml.ai/stable/analysis/sensitivity/
https://diffeqflux.sciml.ai/dev/ControllingAdjoints/

Specifically you can then finish this with Quadrature.jl for representing the integral and taking it (and the Quadrature.jl wrapper is differentiable), or see the part on continuous adjoints of functionals.

If you’re looking for the Fisher information matrix, then maybe you’ll want to use second order adjoint stuff, and forward-over-reverse is generally recommended here. Local Sensitivity Analysis (Automatic Differentiation) · DifferentialEquations.jl. Also I would just try ForwardDiff.hessian on everything if you have small numbers of parameters (<100?).

You can in theory get around this constraint by using Enzyme

You could then Clang compile the C code and we could autodiff that. I haven’t setup a default for that, but I plan to next week. If you look at the Enzyme paper, I made sure some of the examples were PDE semidiscretizations for this reason. I can’t do it until next week though, unless you can ever convince @vchuravy to work on DiffEq stuff :wink: . You’ll be way out in the weeds though, so I’d highly recommend doing it with pure Julia code (or R code)

There’s a funny thing that happens in this case, at least for a lot of astro cases. For long time integrations you may want to use a symplectic integrator. They are used a lot throughout our classical physics models examples:

https://tutorials.sciml.ai/html/models/01-classical_physics.html

To get a clearer picture of what it does you can check out the Kepler model examples

https://tutorials.sciml.ai/html/models/05-kepler_problem.html

Now what does that have to do with autodiff? The reason optimize-then-discretize and discrete-then-optimize are different is because running the ODE method itself backwards (in autodiff) is different from its forwards. Running through the steps of Explicit Euler backwards is Implicit Euler, since the point at which f is approximated is the now going forwards and the future going backwards. But there are methods which are self-adjoint, the Trapezoidal method uses half of f from now and f from the future, which is the same going backwards. So methods with this self-adjointness property have the same derivative in both optimize-then-discretize and discretize-then-optimize if the steps are matched.

It turns out that all symplectic integrators are (a) fixed time step (except for some really weird cases which are fun to think about but haven’t really found a use yet) and (b) self-adjoint, which means astrophysics and other symplectic problems are the case where this distinction comes together.

So I don’t know if this story comes into play here, but it’s a fun one to share haha.

5 Likes

So whilst this is a complete tangent from this topic, I agree that this is super cool. It actually turned up recently in the ML literature here. (With adaptive step sizes IIRC.)

Hey, good to see another Julian cosmologist! I bet forward-mode AD will give you a 10x improvement (over 14 evaluations per parameter, yikes) without any effort.

I’ve been thinking a bit about computing angular statistics too, for CMB stuff (without the Limber approximation, so lots of Bessel in Bolt). I know people in LSS have also tried harmonic space approaches to the integral for these observables, i.e. FFTLog approaches. But definitely check out Quadrature.jl as Chris says above.

3 Likes

I don’t have much to add to the excellent responses so far but figured you might be interested to see an application of this in a familiar setting (cosmology). This is indeed what we do for weak gravitational lensing of the CMB with the LenseFlow algorithm. Eq. 24 of this paper is the ODE, and Eq. 41 is the gradient ODE running backwards in time. Its very easy to teach Zygote to take a gradient by running the gradient ODE, that’s done here, with the forward and backward ODE’s corresponding to those two equations defined here and here.

3 Likes

In the limit where the C code is only a small cost compared to the overall computation, you might be able to build a linear approximation by computing the Jacobian using finite differencing, and then define that as a custom adjoint. Another strategy would be to build a differentiable surrogate model with something like Surrogates.jl.

1 Like

Not necessarily, many astrophysics applications tend to use a lot of symplectic integrators for other reasons so there’s like a 50% chance it’s a tangent, 50% chance it’s what will show up in the actual computation.

2 Likes

Ah, sorry, I didn’t mean that to sound so combative.

No worries. I just mean to say, I don’t know the exact ODE here, but whenever Hairer brings up geometric methods they are for astrophysics applications, so :man_shrugging: we got a good chance of hitting this interesting land here? It’s either that or one of the second order multistep methods that NASA uses and nobody else uses (probably the weirdest phenomena in the field).

1 Like

This is a fun question – I believe the ODE here is the linear evolution of perturbations in the matter power spectrum. After some regime in the early universe where these equations are stiff, these are always integrated with RK, even though large wavenumber modes may encounter many oscillations between entering the horizon and the present. Do you need to conserve energy for these integrations?

I think you’re basically saved by the existence of matter which isn’t dark matter. Everything from regular baryonic matter to neutrinos to photons tends to damp the small scale oscillations, so when you look at something like the Baryon Acoustic Oscillations, you are mostly seeing the very longest wavelength modes, which have only gone through a few oscillation periods since the Big Bang. C_{\ell} is the angular projection of the perturbations in k, and so is dominated by the low k where it’s unnecessary to use a symplectic integrator.

1 Like

Thank you all for you long and detailed answers. Unfortunately, I am not yet able to properly understand all the things you have been talking about: I need to carefully study the papers/videos you have linked.

I’ll dive a bit in the details since both @xzackli and @marius311 will probably follow me without any effort, being them cosmologists;)

As @ChrisRackauckas correctly stated, I am interested in the Fisher Information Matrix ( and hence ForwardDiff.hessian would probably be of interest to me). In particular I am working in LSS (more specifically, I am a member of the ESA Euclid mission) and I have already a working, efficient and well tested code which evaluates the FIM. However there is always room for improvement and I’d like to use autodiff to improve the efficiency of my code (and maybe use this in future works).

Here I will outline the steps necessary to evaluate the C_\ell's in my code.

  1. The background quantities (H(z) and r(z)) are evaluated over the z grid
  2. The nonlinear Power-Spectrum P_{\delta\delta} is evaluated over the k-z grid
  3. P_{\delta\delta} is interpolated and evaluated over the Limber grid (in the future this may change, using FFT-log techniques as @xzackli anticipated)
  4. The weight functions W_i^a(z) are assembled using the precomputed objects
  5. The integrals are computed using Newton-Cotes or Romberg integration schemes

For sure, one of the most sensitive issue is the evaluation of the Matter Power spectrum: actually I am using the Halofit recipe and CLASS as Boltzmann solver, but in the future I could be using an emulator as well (like the one described here ). Another crucial step is the numerical integration: while the Quadrature.jl suggested by @ChrisRackauckas is differentiable, I do not know if is possible to make the Newton-Cotes or Romberg integrations schemes differentiable.

However, the usage of surrogate (and differentiable!) models suggested by @xzackli intrigues me. If I well understand, @marius311 did something similar with PICO, using \sim10^4 CAMB simulations. So, in principle it could be possible evaluate the C_\ell's with my pipeline, give them to Surrogate.jl and use the surrogate model. Probably this is worth testing, at least in a 1D parameter space just to see if this is feasible.

About the ODE integrator: as @xzackli reports, I am talking of the linear evolution of matter perturbations. CLASS uses the ndf15 integration scheme (more details on this can be found here or here ).
I do not know if there are more efficient methods for this specific problem implemented in DiffEq.jl, but in this moment I believe I would not try to write down a new Boltzmann solver (altough if I understand both @marius311 and @xzackli tried something similar).

They are just weighted sums, so they should be “differentiable by default”, i.e. it should already work unless the code is written in a funny way. So while I haven’t tested Richardson.jl on the Newton-Cotes formula, it should work. That said, is there a reason why you’d use this over Gauss points? Non-uniform points would converge much faster.

there’s a ton of similar methods. ODE Solvers · DifferentialEquations.jl. CVODE_BDF, RadauIIA5, radau radau5, dde_bdf, TRBDF2, KenCarp4, etc. There is a QNDF but I wouldn’t recommend NDF formulae in most cases (their stability region is so small!). There’s good reasons why the idea never caught on beyond Shampine’s one library with it (otherwise Sundials would incorporate NDF instead of BDF, since it’s just a coefficient change).

You can find a few benchmarks against MATLAB on classic stiff equations in ODE Solver Multi-Language Wrapper Package Work-Precision Benchmarks (MATLAB, SciPy, Julia, deSolve (R)) . The results are pretty much where you’d expect them to be (up to 7 stiff equations, so if you have more you have to adjust for the BLAS scaling)

4 Likes

Getting accurate linear matter power spectra is definitely one of my top todos for Bolt.jl! But it’ll take me + collaborators some time, and I’m a little worried about some of the details of the neutrino hierarchy that i.e. Lesgourgues has worked hard on. A PICO-like approach sounds pretty effective for getting science done immediately.

1 Like

Maybe the point is that I am not an expert and I do not know exactly what differentiable mean in this context (as I said before, I need to study :sweat_smile:).

Yes, there a reason. The coefficients I evaluate, the C_\ell's, can be represented as a 20\times20 symmetric matrix (so we have 210 distinc elements) for each multipole considered. Usually I evaluate from 100 to 3000 multipoles, so the number of distinct integrals ranges from 210\times100= 2.1\times 10^4 to 210\times3000= 6.3\times 10^5. If I use a fixed grid (altough with more points, this is true) I can evaluate all the needed quantities once for all and then, with the correct combination, I obtain the required result in a much faster way. I remark that the evaluation of P_{\delta\delta} is quite time consuming. Other guys I know use a different approach: they interpolate the Matter Power Spectrum evaluated on the grid and then using Gaussian Quadrature. For a comparison, my Julia code evaluates the whole C_\ell matrix for 3000 multipoles in 0.8 seconds, while, with the latter approach, other codes requires several seconds when the multipoles are about 100 (I did not consider the P_{\delta\delta} computation time in this benchmark since its evaluation does not differ in the two schemes). So, in my particular case using a fixed grid and Newton-Cotes algorithm looks much faster. Furthermore, it is possible that the P_{\delta\delta} is given to me as an external input (its evaluation is not an easy task) on a fixed grid.

I had already seen this comparison and I was really impressed but, as @xzackli says, obtaining accurate power spectra is a really complicated tasks: the Boltzmann codes used by the community have been developed and improved in several years. So, while I think that using Julia it would be possible to reach their performance or even do something better, in this particular moment is a too big task for me…I hope that @xzackli will obtain interesting results in this direction:)

1 Like

@ChrisRackauckas Thank you for your amazing diffeq packages, I’ve been using them to solve these kinds of equations recently and it’s been so nice to work with.

Just to provide some additional context – these equations are very stiff at early times, but this has been known since the 80s, and very robust approximations have been developed in this domain for what is called the “tight coupling” regime. ndf15 and similar efforts are very cool, and I’ve had success solving these equations with the similar radau and rosenbrock methods in DifferentialEquations.jl, but getting the evolution within the usual tolerances can be accomplished with the usual approximations + any RK method like Tsit5.

1 Like

What about the stiffness detecting methods? AutoTsit5(Rosenbrock23()) or the sort may be interesting here?

1 Like

Yeah, sure: it is possible to use RK but, as stated in CLASS papers, the computation time is much longer.

I’ll definitely try that out!

I believe it’s only about a factor of 2 between RK and ndf15 if one uses tight coupling, and it seems to depend on the approximation scheme made for the relativistic neutrinos. I admit I don’t completely understand how their method outperform RK, I’m guessing it’s because we turn off these tight coupling approximations when the evolution becomes only moderately stiff, and then you want a scheme that can deal with that transition. But as Chris says, there are very similar integrators in DifferentialEquations.jl.

1 Like