Good practices regarding automatic differentiation in a Gaussian process implementation

Nikos_Gianniotis · March 3, 2026, 9:20am

(Related to Large memory consumption when using Mooncake via DifferentiationInterface for Gaussian process optimisation)

I’m starting this thread to discuss good practices for using Mooncake.jl via DifferentiationInterface.jl together with Optim.jl when implementing Gaussian processes (GPs). I use GPs in my research, but I often run into numerical issues. In particular, I work with the GPLVM model and often I need to optimise thousands of parameters. Some of these issues I face may be unrelated to Mooncake.jl or DifferentiationInterface.jl and instead stem from implementation details e.g. type-instability or other oversights. My goal is to document these problems and their resolutions in a way that others can reuse by making them available on an online resource (perhaps here). To do this, I will share a sequence of code examples, where each iteration isolates and fixes a specific issue.

Although the Julia ecosystem offers many relevant packages, I want to keep the examples “self-contained” with as few dependencies as possible, so that the underlying issues are easy to identify.

Below I start with what I consider straightforward code, the kind a casual Julia user (including myself) might write. I begin with standard GP regression with one-dimensional inputs and outputs. I plan to address issues systematically, either within this thread or by opening a separate thread per issue (I’d appreciate your guidance on which approach is preferable).

The first issue that I want to address concerns a certain numerical problem that I have run into multiple times. Though the code may be inefficient in many respects, the focus is on isolating and hopefully resolving a very specific issue.

Issue 1: The code below implements standard GP regression for one-dimensional inputs and one-dimensional outputs using the RBF kernel. Function fitgp_1 implements two versions of the objective function to be maximised. The first version is called marginal_loglikelihood_MvNormal and uses Distributions.MvNormal to calculate the log-likelihood. The second version is called marginal_loglikelihood_explicit and calculates the log-likelihood explicitly. Both versions agree in that they calculate the same log-likelihood when given the same hyperparameters. However, when optimising using automatic gradients I get wildly different results. In particular, the version marginal_loglikelihood_explicit seems to suffer from numerical issues. It is not obvious why there is such a discrepancy between the two versions.

Why does the second version behave so differently?

### UPDATED CODE 5/3/2026 ####

using Distributions
using DifferentiationInterface
using LinearAlgebra
using Optim
using Random
import Mooncake

function toydata()
    # generate some synthetic data, not important how exactly,
    # just to have something to test the code on
    rng = MersenneTwister(1234)
    x = rand(rng, 1000)*10
    y = sin.(x) + 0.1*randn(rng, length(x))
    return y, x
end

function fitgp_1(t, x; iterations = 10, JITTER = 1e-6)

    # t are the target outputs (here one-dimensional)
    # x are the inputs (here one-dimensional)

    # Initialise all hyperparameters to zero
    logℓ, logα, logσ = 0.0, 0.0, 0.0
   
    rbf(x, y, logℓ, logα) = exp(-abs2(x - y)*exp(logℓ))*exp(logα)

    function calculatecovariance!(K, logℓ, logα)
        # Can be done better, but the point is to understand where
        # problems may arise with allocations and performance
        # when writing own code

        local N = length(x)

        for i in 1:N
            for j in 1:N
                K[i, j] = rbf(x[i], x[j], logℓ, logα)
            end
        end

    end

    function unpack_parameters(params)
        local logℓ, logα, logσ = params
        return logℓ, logα, logσ
    end

    # calculate the covariance matrix for the passed hyperparameters, add noise variance and jitter
    function calculateK(x, logℓ, logα, logσ)
        local K = zeros(length(x), length(x))
        calculatecovariance!(K, logℓ, logα)
        K += exp(logσ) * I # add noise variance to the diagonal
        K += JITTER * I * exp(logα) # add jitter for numerical stability
        return K
    end

    function marginal_loglikelihood_MvNormal(logℓ, logα, logσ)

        # calculate the covariance matrix for the passed hyperparameters
        local K = calculateK(x, logℓ, logα, logσ)

        # return log marginal likelihood of a Gaussian process for Gaussian likelihood
        return logpdf(MvNormal(zeros(length(x)), K), t)

    end


    function marginal_loglikelihood_explicit(logℓ, logα, logσ)

        # calculate the covariance matrix for the passed hyperparameters
        local K = calculateK(x, logℓ, logα, logσ)

        # return log marginal likelihood of a Gaussian process for Gaussian likelihood
        local L = cholesky(Symmetric(K)).L
        return -0.5 * sum(abs2, L \ t) - sum(log, diag(L)) - 0.5*length(x)*log(2π)
        
    end

    # define negative log-likelihood for optimization
    nll_MvNormal(params) = -marginal_loglikelihood_MvNormal(unpack_parameters(params)...)
    nll_explicit(params) = -marginal_loglikelihood_explicit(unpack_parameters(params)...)

    # Sanity check:
    # show that the two implementations of the marginal log-likelihood 
    # differ only very slightly for the same hyperparameters.
    let
        initparam = randn(3)
        @info "evaluate nll_MvNormal with random parameters" nll_MvNormal(initparam)
        @info "evaluate nll_explicit with random parameters" nll_explicit(initparam)
    end

    # Optimise hyperparameters using LBFGS
    opt = Optim.Options(iterations = iterations, show_trace = true, show_every = 1)

    # optimise using the MvNormal implementation
    @info "optimising with nll_MvNormal"
    res1 = optimize(nll_MvNormal, [logℓ, logα, logσ], LBFGS(), opt, autodiff=Mooncake.AutoMooncake())
 
    # display condition number of the covariance matrix
    @info "cond(K) after optimising with nll_MvNormal"
    display(cond(calculateK(x, unpack_parameters(res1.minimizer)...)))
    
    # optimise using the explicit implementation
    @info "optimising with nll_explicit"
    res2 = optimize(nll_explicit, [logℓ, logα, logσ], LBFGS(), opt, autodiff=Mooncake.AutoMooncake())

    # display condition number of the covariance matrix
    @info "cond(K) after optimising with nll_explicit"
    display(cond(calculateK(x, unpack_parameters(res2.minimizer)...)))
end

Provided that packages are locally available, you can copy-paste the above code and run it with:

y,x = toydata()
fitgp_1(y,x)

What you should see when running the above code is that the second optimisation run that uses marginal_loglikelihood_explicit runs into numerical issues and breaks prematurely.

My local setup is Julia Version 1.12.4:

### UPDATED STATUS 5/3/2026 ####
(MooncakeGPExperiments) pkg> st
Status `~/tmp/MooncakeGPExperiments/Project.toml`
  [a0c0ee7d] DifferentiationInterface v0.7.16
  [31c24e10] Distributions v0.25.123
  [da2b9cff] Mooncake v0.5.14
  [429524aa] Optim v2.0.1
  [37e2e46d] LinearAlgebra v1.12.0
  [9a3f8284] Random v1.11.0

ForceBru · March 3, 2026, 9:39am

Regarding “running into numerical issues and breaking prematurely”, I’m getting the following output:

julia> includet("test-gp.jl")
nll_MvNormal(initparam) = 1256.6934078230022
nll_explicit(initparam) = 1256.6934078230022
[ Info: optimise using the MvNormal implementation
Iter     Function value   Gradient norm 
     0     9.535819e+02     4.870311e+02
 * time: 0.008840084075927734
     1    -7.815952e+02     6.417509e+01
 * time: 1.4313230514526367
     2    -7.871668e+02     2.077147e+01
 * time: 1.724761962890625
     3    -7.906472e+02     4.496643e+01
 * time: 2.114284038543701
     4    -8.066678e+02     4.151909e+01
 * time: 2.6622111797332764
     5    -8.095018e+02     1.087353e+01
 * time: 2.8410961627960205
     6    -8.098022e+02     5.253655e+00
 * time: 3.1320290565490723
     7    -8.099309e+02     7.471592e+00
 * time: 3.526421070098877
     8    -8.108028e+02     4.509602e+00
 * time: 3.854161024093628
     9    -8.111302e+02     4.800723e-01
 * time: 4.141431093215942
    10    -8.111457e+02     2.406922e-01
 * time: 4.5342631340026855
optimize(nll_MvNormal, [logℓ, logα, logσ], LBFGS(), opt, autodiff = Mooncake.AutoMooncake()) =  * Status: failure (reached maximum number of iterations)

 * Candidate solution
    Final objective value:     -8.111457e+02

 * Found with
    Algorithm:     L-BFGS

 * Convergence measures
    |x - x'|               = 1.12e-01 ≰ 0.0e+00
    |x - x'|/|x'|          = 2.47e-02 ≰ 0.0e+00
    |f(x) - f(x')|         = 1.55e-02 ≰ 0.0e+00
    |f(x) - f(x')|/|f(x')| = 1.91e-05 ≰ 0.0e+00
    |g(x)|                 = 2.41e-01 ≰ 1.0e-08

 * Work counters
    Seconds run:   5  (vs limit Inf)
    Iterations:    10
    f(x) calls:    35
    ∇f(x) calls:   35
    ∇f(x)ᵀv calls: 0

[ Info: optimise using the explicit implementation
Iter     Function value   Gradient norm 
     0     9.535819e+02     4.870311e+02
 * time: 9.608268737792969e-5
     1     7.082533e+02     1.087999e+07
 * time: 0.30930209159851074
     2     7.082540e+02     1.065883e+29
 * time: 6.451901912689209
┌ Warning: Failed to achieve finite new evaluation point, using alpha=0
└ @ LineSearches ~/.julia/packages/LineSearches/lihz0/src/hagerzhang.jl:156
     3     7.082540e+02              NaN
 * time: 11.357081890106201
┌ Warning: Terminated early due to NaN in gradient.
└ @ Optim ~/.julia/packages/Optim/DtV5C/src/multivariate/optimize/optimize.jl:136
optimize(nll_explicit, [logℓ, logα, logσ], LBFGS(), opt, autodiff = Mooncake.AutoMooncake()) =  * Status: failure

 * Candidate solution
    Final objective value:     7.082540e+02

 * Found with
    Algorithm:     L-BFGS

 * Convergence measures
    |x - x'|               = 0.00e+00 ≤ 0.0e+00
    |x - x'|/|x'|          = 0.00e+00 ≤ 0.0e+00
    |f(x) - f(x')|         = 0.00e+00 ≤ 0.0e+00
    |f(x) - f(x')|/|f(x')| = 0.00e+00 ≤ 0.0e+00
    |g(x)|                 = NaN ≰ 1.0e-08

 * Work counters
    Seconds run:   11  (vs limit Inf)
    Iterations:    3
    f(x) calls:    113
    ∇f(x) calls:   113
    ∇f(x)ᵀv calls: 0

So indeed, the second implementation (nll_explicit) is experiencing numerical issues.

Nikos_Gianniotis · March 3, 2026, 9:42am

This is exactly what I observe. I observe exactly the same numbers during the iterations of the optimiser and the premature break at exactly the same iteration. Thanks for confirming.

ForceBru · March 3, 2026, 10:29am

I found some parameter values that produce very different gradients.

nll_MvNormal(params::AbstractVector) = -marginal_loglikelihood_MvNormal(params...)
nll_explicit(params::AbstractVector) = -marginal_loglikelihood_explicit(params...)

grad_MvNormal(params::AbstractVector) = gradient(nll_MvNormal, Mooncake.AutoMooncake(), params)
grad_explicit(params::AbstractVector) = gradient(nll_explicit, Mooncake.AutoMooncake(), params)

@testset "nlls equal" begin
    for _ in 1:200
        initparam = 8randn(3)
        @test nll_MvNormal(initparam) ≈ nll_explicit(initparam)
    end
end

@testset "nll gradients equal" begin
    for i in 1:200
        initparam = 8randn(3)
        tbeg = time()
        the_grad_MvNormal = grad_MvNormal(initparam)
        time_MvNormal = time() - tbeg

        tbeg = time()
        the_grad_explicit = grad_explicit(initparam)
        time_explicit = time() - tbeg

        @printf("%d %s\tMvNormal:%.4f explicit:%.4f\n", i, string(initparam), time_MvNormal, time_explicit)
        @test the_grad_MvNormal ≈ the_grad_explicit
    end
end

Output:

julia> fitgp_1(y, x)
Test Summary: | Pass  Total  Time
nlls equal    |  200    200  6.4s
1 [-14.199736251937544, -9.10561555706572, 2.6538254797721703]	MvNormal:1.8051 explicit:1.5531
2 [5.609290053467907, -0.3885215223335294, 1.3378757003073125]	MvNormal:0.5607 explicit:0.5135
3 [-4.4435922197548035, 0.5150626168870941, 0.5421203043287387]	MvNormal:0.5207 explicit:0.3041
4 [5.731650553337068, -6.735325591925515, 0.3741385199880161]	MvNormal:0.4634 explicit:0.5015
5 [2.535604988715315, 3.54025643504518, -7.261611030713925]	MvNormal:0.5241 explicit:0.3096
6 [2.362412579435235, 4.949111093754604, 5.847255214438686]	MvNormal:0.4619 explicit:0.5019
7 [-3.3609508312745775, -7.704552624782568, 0.7951606709014786]	MvNormal:0.2993 explicit:0.4610
8 [3.0501851494084233, 14.735943518357502, -19.306326079864107]	MvNormal:0.5147 explicit:0.3082
nll gradients equal: Test Failed at /Users/forcebru/test/test-gp.jl:81
  Expression: the_grad_MvNormal ≈ the_grad_explicit
   Evaluated: [-977214.8128871938, -25446.0, -18059.25871024232] ≈ [-971367.1885299702, -28378.0, -18059.258710238602]
...
10 [-0.08113466534285438, 15.506494081130453, -8.708303885960373]	MvNormal:0.2992 explicit:0.4706
nll gradients equal: Test Failed at /Users/forcebru/test/test-gp.jl:81
  Expression: the_grad_MvNormal ≈ the_grad_explicit
   Evaluated: [-373.8878286961408, -2.610595703125, -30560.1728372343] ≈ [-373.6003559526871, -5.477783203125, -30560.17283722648]
...
21 [-1.8824827558699302, 12.481122633478888, -5.412806554614432]	MvNormal:0.4609 explicit:0.5116
nll gradients equal: Test Failed at /Users/forcebru/test/test-gp.jl:81
  Expression: the_grad_MvNormal ≈ the_grad_explicit
   Evaluated: [18.50852768868208, 6.390734702348709, -697.4361307374119] ≈ [18.508415449410677, 6.389977499842644, -697.4361307374121]
...
24 [-2.8114261619152034, 8.802134156724248, -9.63474023524947]	MvNormal:0.3042 explicit:0.4650
nll gradients equal: Test Failed at /Users/forcebru/test/test-gp.jl:81
  Expression: the_grad_MvNormal ≈ the_grad_explicit
   Evaluated: [-316.85580825805664, -20.739328384399414, -78753.90638871562] ≈ [-316.88093090057373, -20.64815902709961, -78753.90638871577]
...
123 [-12.490092744128612, 23.9155285615708, -4.605820340345509]	MvNormal:0.3069 explicit:0.4683
nll gradients equal: Test Failed at /Users/forcebru/test/test-gp.jl:81
  Expression: the_grad_MvNormal ≈ the_grad_explicit
   Evaluated: [-184.94972801208496, -857.453125, -11823.697753225504] ≈ [-184.9943962097168, -1208.28125, -11823.697753225508]
...
128 [-3.467596657790478, 9.362793887890204, -14.569677208674502]	MvNormal:0.3088 explicit:0.4747
nll gradients equal: Test Failed at /Users/forcebru/test/test-gp.jl:81
  Expression: the_grad_MvNormal ≈ the_grad_explicit
   Evaluated: [-13023.033203125, -1254.0859375, -1.164290547821325e6] ≈ [-12988.89453125, -815.5, -1.1642905478213246e6]
...
170 [-3.8382965672555174, 21.092804035316234, -6.730961097268818]	MvNormal:0.5143 explicit:0.3081
nll gradients equal: Test Failed at /Users/forcebru/test/test-gp.jl:81
  Expression: the_grad_MvNormal ≈ the_grad_explicit
   Evaluated: [27.40576171875, -14.9638671875, -3988.72208146421] ≈ [9.25537109375, 18.4873046875, -3988.722081464212]
...

In some cases (mostly when the second element of the vector is large and positive and the third is large and negative), the derivative wrt the second parameter can vary a lot and even change sign.

Nikos_Gianniotis · March 4, 2026, 9:04am

Thanks for checking. When I switch from Mooncake to ForwardDiff, the second version, nll_explicit, works identically to the nll_MvNormal. It seems that Mooncake has a certain susceptibility, which surprises me as I believe that Distributions.MvNormal (which internally uses PDMats) must do very similar calculations. I don’t know what conclusion to draw from this.

ForceBru · March 4, 2026, 10:16am

Yep, can confirm, ForwardDiff works fine here. In my experience, ForwardDiff “just works” in most cases: it doesn’t crash (unlike, say, Enzyme.jl), supports mutation (unlike Zygote.jl) and is generally “well-behaved”. It’s not even slow. Sure, I had to tweak the marginal_likelihood_* functions to create the matrix K of type Dual, but that’s very easy, simply make K hold the same type as the inputs:

function marginal_loglikelihood_MvNormal(logℓ::T, logα::T, logσ::T) where T<:Real
  local K = zeros(T, length(x), length(x))
  calculatecovariance!(K, logℓ, logα)

  K += exp(logσ) * I + JITTER * I 

  logpdf(MvNormal(zeros(length(x)), K), t)
end

Nikos_Gianniotis · March 4, 2026, 12:27pm

Thanks for confirming, this really helps. At this point, I am inclined to say that perhaps in its current state Mooncake.jl may not be the best choice for GPs as it shows difficulties with a calculation that lies at the heart of GPs (unless sb can show how the numerics in the code can be improved?). Obviously, this is not an overall critisism of Mooncake.jl (which I use for many other things). What I want to do here is establish what currently works for GPs. I am now working on an Enzyme.jl solution, which I hope to post very soon.

stevengj · March 4, 2026, 12:38pm

It’s forward-mode AD, which is easier to implement but is slow if you have lots of parameters — the cost scales linearly with the number of parameters. As opposed to reverse-mode AD, which is harder to implement (and thus more fragile), but whose cost scales linearly with the number of outputs, roughly independent of the number of parameters.

You have to know your tools. If you have around 10 parameters or fewer, forward mode is probably the method of choice. But if you have thousands (or millions) of parameters, it’s impractical.

Tamas_Papp · March 4, 2026, 12:59pm

It would be best to isolate an MWE and open an issue at the Mooncake.jl repo.

I have very good experience with Enzyme.jl, the maintainers are very responsive. “Crash” is such a broad term so it is difficult to figure out the problem you experienced with it, but generally when it doesn’t work the error messages are informative, and the FAQ contains a lot of helpful suggestions (I found that fixing runtime activity is very important when running long and complex calculations, also inference barriers, etc).

Nikos_Gianniotis · March 4, 2026, 1:13pm

Absolutely right, I will do this. Nevertheless, one reason why I say that “I don’t know what conclusion to draw from this” is simply because, for all I know, the code above could be doing a numerically unstable operation that I am unaware of, but an expert would easily detect it.

Tamas_Papp · March 4, 2026, 1:29pm

Sorry if I missed something, but if the discrepancy is coming from marginal_loglikelihood_explicit, just code up the part that is different (eg the MvNormal logpdf) into a separate function and compare directly to Distributions.jl, both for values and gradients. Note that the latter also has gradlogpdf you can compare to.

I don’t see anything immediately wrong with the code, but it is easy to miss details.

ForceBru · March 4, 2026, 2:10pm

I know, this is the usual forward vs reverse dichotomy, this is why neural networks (that normally have tons of parameters, but much fewer outputs) usually use “backprop” aka reverse-mode autodiff like loss.backward() in PyTorch.

Mooncake.jl is actually faster in this scenario, even though there are only 3 parameters:

display(@benchmark $grad_MvNormal([0.3, 1, 3.]))

Mooncake: 407.494 ms ± 54.689 ms

BenchmarkTools.Trial: 13 samples with 1 evaluation per sample.
 Range (min … max):  226.733 ms … 437.427 ms  ┊ GC (min … max):  5.85% … 44.38%
 Time  (median):     419.233 ms               ┊ GC (median):    44.96%
 Time  (mean ± σ):   407.494 ms ±  54.689 ms  ┊ GC (mean ± σ):  43.31% ± 10.88%

                                                         █       
  ▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄█▇▁▄▄▄ ▁
  227 ms           Histogram: frequency by time          437 ms <

 Memory estimate: 1.06 GiB, allocs estimate: 1135.

ForwardDiff: 607.663 ms ± 56.760 ms (1.5 times slower)

BenchmarkTools.Trial: 9 samples with 1 evaluation per sample.
 Range (min … max):  564.525 ms … 711.579 ms  ┊ GC (min … max): 0.30% … 19.71%
 Time  (median):     576.651 ms               ┊ GC (median):    0.50%
 Time  (mean ± σ):   607.663 ms ±  56.760 ms  ┊ GC (mean ± σ):  5.91% ±  8.40%

  ▁ ▁▁▁▁        █                                        ▁    ▁  
  █▁████▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁█ ▁
  565 ms           Histogram: frequency by time          712 ms <

 Memory estimate: 91.70 MiB, allocs estimate: 35.

However, it seems like Mooncake simply doesn’t work here (optimization fails, possibly because of incorrect gradients), but ForwardDiff does, and it’s still not that much slower IMO.

ForceBru · March 4, 2026, 2:22pm

Yeah, I tried Enzyme a couple times, couldn’t get it to work (can’t say I tried really hard to fix it, though) and went back to JAX, ForwardDiff and Mooncake. I’m not looking for a solution (the libraries I mentioned work just fine, although I’m now concerned about Mooncake), so I just vaguely complained about “crashes”. To me, “activity” just adds cognitive load. In other words, I don’t understand what it is and I don’t really want to understand it…

stevengj · March 4, 2026, 4:32pm

If you insert @show cond(K), you will see that K is very ill-conditioned in that case:

cond(K) = 1.220660023608484e14
1.220660023608484e14

You have a “fudge-factor” of JITTER = 1e-6 to try to improve the conditioning, but it doesn’t work, and neither does the exp(logσ) ≈ 4.12e-9 diagonal factor, because norm(K) ≈ 4.2e8 (mostly because exp(logα) ≈ 2.5e6): the diagonal regularization is too small.

(This is a common mistake in floating-point arithmetic — people assume that there is an absolute scale for “small” and “large” values, so that e.g. 10^{-6} is “small” in some absolute way, when in fact everything is relative — if you want to add a regularization, it needs to be proportional to the magnitudes of the matrix entries in some way. But I’m always suspicious when I see people insert arbitrary “jitter” factors in an ad hoc attempt to compensate for floating-point issues.)

So, your computation is incredibly ill-conditioned and hence vulnerable to roundoff error. It’s not surprising, therefore, that small changes in the algorithm (or in the AD package) can give very different results, especially when computing derivatives (which tend to worsen conditioning). I would treat it as somewhat a matter of luck whether optimization works, especially with the L-BFGS algorithm (which tends to be more sensitive to gradient errors than first-order methods).

I would suggest thinking through your algorithm (and/or your problem formulation) to avoid the ill-conditioning. (It may also be that by analytically writing out the gradient, you can combine some steps analytically and eliminate some of the ill-conditioning.) Sometimes, even if intermediate steps are ill-conditioned, the computation can be re-arranged to use different intermediate steps and obtain an accurate final result.

PS. A simple “fix” is to replace JITTER * I with JITTER*exp(logα) * I, to make the regularization proportional to the magnitude of the K entries. This causes your gradient comparison to pass, but I have no idea whether this regularization is appropriate for your underlying problem.

PPS. The JITTER*exp(logα) conditioning fix also causes the result of Mooncake to match ForwardDiff.gradient. So my conclusion is that there is nothing wrong with Mooncake’s gradient, just that your ill-conditioning was screwing up every AD, and you happened to get lucky with ForwardDiff and Optim.

Nikos_Gianniotis · March 5, 2026, 8:48am

Thanks for all the great feedback so far. Prompted by the comments, I have slightly updated the original code at the top to make it more readable and output useful messages when executed.

The jitter constant is a very wide-spread, easy-fix in Gaussian processes (which doesn’t make it right of course). The typical practice is to fix jitter to a constant value between 1e-6 and 1e-12. Nevertheless, I have seen people make the jitter dependent on the matrix NxN matrix K, for instance JITTER=10−8*trace(K)/N where N is the number of data items. In the particular setting of GPs, jitter can be interpreted as additional small noise present on the observations. Prompted by your comment, I will follow your advice and include this in my “good practices for GPs”.

Indeed, matrix K is ill-conditioned. However, using nll_MvNormal in the code above seems to work well in contrast to using nll_explicit. I would say that both methods implement the same algorithm but differ in the exact implementation (I hope I am not splitting hairs).

I can indeed confirm from practical experience that often conjugate gradients are more robust than LBFGS to various numerical issues (I realise that this is a rather vague statement). That said, I am not entirely sure whether the version that uses nll_explicit fails due to “bad luck”. I am working on a solution that uses Enzyme.jl and in this case optimising nll_explicit works just as well as when optimising nll_MvNormal (still need to post the code for this solution).

I have included this type of conditioning in the code above, but unfortunately it doesn’t really improve things. I paste below the output I observe when I execute the code at the top.
I execute the code with:

y,x = toydata()
fitgp_1(y,x)

The output in the terminal is:

┌ Info: evaluate nll_MvNormal with random parameters
└   nll_MvNormal(initparam) = 732.0474654856366
┌ Info: evaluate nll_explicit with random parameters
└   nll_explicit(initparam) = 732.0474654856365
[ Info: optimising with nll_MvNormal
Iter     Function value   Gradient norm 
     0     9.535819e+02     4.870311e+02
 * time: 0.019244909286499023
     1    -5.166658e+02     3.521307e+02
 * time: 3.3068389892578125
     2    -6.869070e+02     2.509137e+02
 * time: 3.9092328548431396
     3    -7.789686e+02     9.765813e+01
 * time: 4.549057960510254
     4    -7.924799e+02     1.943095e+01
 * time: 4.800872802734375
     5    -7.984767e+02     6.225951e+01
 * time: 5.4755539894104
     6    -8.074275e+02     1.487647e+01
 * time: 6.294051885604858
     7    -8.087318e+02     9.400895e+00
 * time: 6.560976028442383
     8    -8.094513e+02     2.582853e+00
 * time: 6.936873912811279
     9    -8.095931e+02     3.261431e+00
 * time: 7.31149697303772
    10    -8.102746e+02     1.078596e+01
 * time: 7.980554819107056
[ Info: cond(K) after optimising with nll_MvNormal
39149.49498543803
[ Info: optimising with nll_explicit
Iter     Function value   Gradient norm 
     0     9.535819e+02     4.870311e+02
 * time: 4.1961669921875e-5
     1     9.534211e+02     3.626311e+09
 * time: 0.6280961036682129
     2     9.534221e+02     5.560489e+26
 * time: 9.068780183792114
┌ Warning: Failed to achieve finite new evaluation point, using alpha=0
└ @ LineSearches ~/.julia/packages/LineSearches/lihz0/src/hagerzhang.jl:156
     3     9.534221e+02              NaN
 * time: 15.835925102233887
┌ Warning: Terminated early due to NaN in gradient.
└ @ Optim ~/.julia/packages/Optim/DtV5C/src/multivariate/optimize/optimize.jl:136
[ Info: cond(K) after optimising with nll_explicit
178.1414159026376

I also tried changing from Optim.LBFGS() to Optim.ConjugateGradient() but the difficulty when optimising with nll_explicit persists.

Topic		Replies	Views
Large memory consumption when using Mooncake via DifferentiationInterface for Gaussian process optimisation General Usage question , memory-allocation , gaussian-process , autodiff	19	855	April 15, 2025
Excessive memory consumption when optimising Gaussian process regression with automatic reverse differentiation Performance question , gaussian-process , autodiff	2	187	April 7, 2025
Hessian preparation using DifferentationInterface and Mooncake throws a stackoverflow error Specific Domains question	9	366	February 17, 2025
MAP optimization in Turing with LBFGS doesn't move at all Probabilistic Programming optimization , turing , autodiff	22	531	August 23, 2024
Automatic Differentiation Slow (Slower than Finite-Differences) Optimization (Mathematical)	18	3751	May 31, 2018

Good practices regarding automatic differentiation in a Gaussian process implementation

Mooncake: 407.494 ms ± 54.689 ms

ForwardDiff: 607.663 ms ± 56.760 ms (1.5 times slower)

Related topics