ANN MinimallyDisruptiveCurves.jl

Hi all,

I thought I would crosspost my package announcement here.

  • The package implements a recent algorithm for regional sensitivity analysis of models with many parameters. Any models, as long as there is a differentiable cost function.

  • The post on the package announcement board is here

  • The user guide, which contains motivation, examples, and jupyter notebooks, is here

  • It relies heavily on the amazing DifferentialEquations.jl ecosystem, so thanks to all who contributed to that! Basically evolves an ODE on the parameters of a model, in such a way that the cost function remains as low as possible.

  • I’m targeting people who have mechanistic models (from e.g. systems biology, computational neuroscience), which they’ve already fitted, and want to analyse further.

  • Feedback (positive or negative!) would be really welcome. It’s a work in progress. Thanks!

7 Likes

In the main package announcement thread here, I have also provided code and a comparison of the method vs the global sensitivity analysis techniques used on the lotka volterra example in Global Sensitivity Analysis · DifferentialEquations.jl

Hi and thanks for this cool package! I had a stab at implementing the parameter-covariance estimation that is the topic of chapter 6 in your thesis. My trivial prototype with a linear Gaussian model y = Ax is below. Would you say that I have understood the method properly? In particular, am I implementing the equation 6.7 correctly, i.e. \nabla_\theta (D) = (\nabla_\theta^2)^{-1} \nabla_D \nabla_\theta L(\theta, D)?

Thanks :slight_smile:

using ForwardDiff, Test
using ForwardDiff: jacobian, gradient, hessian

# Generate some problem data y = Ax
n  = 1000
m  = 4
x  = randn(m)
A  = randn(n,m)
y  = A*x
σₑ = 1
e  = σₑ*randn(n)
yn = y + e # Add noise to measurements

# estimate parameters
function est(A, yn)
    xh = A\yn
end

# calculate sum of squared errors
function errors(A, yn, xh = est(A,yn))
    sum(abs2, yn - A*xh)/2
end

# This should hold for a linear Gaussian model
xh = est(A,yn)
H  = hessian(xh->errors(A,yn,xh), xh)
@test σₑ^2*inv(H) ≈ σₑ^2*inv(A'A)

# Implements eq. 6.8
function parameter_covariance(sumsquares, yi, xh)
    H = hessian(xh->sumsquares(yi,xh), xh) # ∇²θ
    J = jacobian(yi) do y # ∇D
        gradient(xh->sumsquares(y,xh), x) # ∇θ
    end # ∇D∇θ
    ∇ = Symmetric(H)\J # eq. 6.7  (∇²θ)⁻¹∇D∇θ
    @show σ² = 2/(length(yi) - 1)*sumsquares(yi,xh)
    G = σ²*∇*∇' # eq. 6.8
end

sumsquares(y,x) = errors(A,y,x)

C = parameter_covariance(sumsquares, yn, xh) # This should now be 
@test C ≈ σₑ^2*inv(H) atol = 1e-4

Hi, thanks for your interest! Gosh it’s nice to know somebody is reading P164 of my thesis…that’s not something I would have thought of as a high likelihood event.

Disclaimer for anybody else reading: this is NOT related to MinimallyDisruptiveCurves.jl (which riffs off chapters 4/5 of my thesis)

I looked at your code. The implementation of 6.7 seems to be correct upon inspection. I would note that I wrote the thesis in the dark days before I knew of automatic differentiation. Nowadays I would just write a program mapping the data to a parameter estimate, and differentiate it to get the LHS of equation 6.7 without the hassle. The outcome might have better numerical conditioning as well, since inverse Hessians are ugly…

In terms of understanding the method… if you have any particular questions I’d be happy to answer them.
For an overall elevator speech of the intuition…
Background

  1. When modelling, we often estimate the covariance of a parameter estimate using the Cramer Rao lower bound, which tells us to invert the Fisher Information Matrix, and is a perfect estimate in the limit of increasing data.
  2. This depends upon a hidden assumption of Frequentist statistics: discrepancy between model predictions (at the optimal parameter estimate) and data is purely due to measurement noise. The ‘perfect model’ assumption.
  3. Goal is to make a different covariance approximation that doesn’t rely on this assumption, but asymptotically agrees with Cramer Rao in the limit of increasing data, if the perfect model assumption holds. (and also asymptotically agrees with the true covariance, of course!)

How

  1. Estimate the distribution of the data, independent of a model.
  2. What is the covariance of the model’s parameter estimate, as a function of the empirical data distribution?
  3. Use the implicit function theorem (equation 6.7, but you could use automatic differentiation instead) to find the gradient of the parameter estimate with respect to perturbations in the data
  4. Use the delta method (of statistics, not neural networks) to turn this into an expression for the covariance we wanted in 2.
  5. Do some maths to show this estimate has the properties we asked for in the background section.

As a side note I would like to use automatic differnetiation to implement this method nicely in Julia. In particular I think you could generalise it past the restricted case of ODEs and Gaussian measurement noise that I considered in the thesis. But it’s on my low priority list due to time. If anybody is interested in collaborating hit me up!

1 Like

Thanks for your reply!

For completeness, I guess this direct application of AD to find \nabla_D \theta is what you mean?

# Implements eq. 6.8 using AD directly
function parameter_covariance2(est, sumsquares, yi)
    ∇ = jacobian(est, yi)
    xh = est(yi)
    σ² = 2/(length(yi) - 1)*sumsquares(yi,xh)
    G = σ²*∇*∇' # eq. 6.8
end

and it does indeed seem to yield the same results in this primitive example

julia> @btime C = parameter_covariance(sumsquares, yn, xh)
  9.608 ms (1002 allocations: 45.60 MiB)
4×4 Matrix{Float64}:
  0.00104353   1.56565e-5  -1.38611e-5  -4.13944e-6
  1.56565e-5   0.00106845  -1.43595e-5  -3.17278e-5
 -1.38611e-5  -1.43595e-5   0.00097011  -1.35513e-5
 -4.13944e-6  -3.17278e-5  -1.35513e-5   0.00103502

julia> @btime C2 = parameter_covariance2(y->est(A,y), sumsquares, yn)
  8.691 ms (1529 allocations: 44.84 MiB)
4×4 Matrix{Float64}:
  0.00104353   1.56565e-5  -1.38611e-5  -4.13944e-6
  1.56565e-5   0.00106845  -1.43595e-5  -3.17278e-5
 -1.38611e-5  -1.43595e-5   0.00097011  -1.35513e-5
 -4.13944e-6  -3.17278e-5  -1.35513e-5   0.00103502

I think this would be very valuable I would be keen to help out with this. The implementation of eq. 6.8 using AD is essentially realized above and requires only a function from data to parameter estimate. In practice, such a function might be harder to apply AD to than it was above, since not all iterative solvers etc. are amenable to AD without intervention. The way through eq. 6.7 is probably more likely to work with AD out of the box. What’s required to generalize this to non Gaussian measurement noise is currently beyond me, but I would sure be interested in finding out.

Not directly related, but you may find the following discussion interesting Reverse differentiation through nlsolve · Issue #205 · JuliaNLSolvers/NLsolve.jl · GitHub

Thanks, yes the direct appliction of AD to find \nabla_D \theta I was thinking about is as you posted.

Interesting to see the linked thread about reverse diff through an optimisation routine. I vaguely remember skimming some control literature a few months ago that did AD through optimisation routines that synthesise optimal controllers (LQG?). So I perhaps naively thought it was tractable without too much thought. I guess not!

In that case, the method in equation 6.7 might indeed be a good way of doing things out of the box, as you say.

Thanks for your interest in helping out! Maybe we can talk offline about how to set this up? I guess the coding burden for equation 6.7 isn’t too hard, but maybe some thought is required as to how to structure a package to be as easy to use and modular as possible.

Generalising to other measurement noises: I wrote some very scratchy notes as I was furiously finishing up my PhD. I haven 't got the time to make complete sense of them right now (assumng they make sense). But I think it was something to do with using the unscented transform to directly estimate moments of the induced probability distribution of the parameter estimate, given the empirical data distribution.