(Related to Large memory consumption when using Mooncake via DifferentiationInterface for Gaussian process optimisation)
I’m starting this thread to discuss good practices for using Mooncake.jl via DifferentiationInterface.jl together with Optim.jl when implementing Gaussian processes (GPs). I use GPs in my research, but I often run into numerical issues. In particular, I work with the GPLVM model and often I need to optimise thousands of parameters. Some of these issues I face may be unrelated to Mooncake.jl or DifferentiationInterface.jl and instead stem from implementation details e.g. type-instability or other oversights. My goal is to document these problems and their resolutions in a way that others can reuse by making them available on an online resource (perhaps here). To do this, I will share a sequence of code examples, where each iteration isolates and fixes a specific issue.
Although the Julia ecosystem offers many relevant packages, I want to keep the examples “self-contained” with as few dependencies as possible, so that the underlying issues are easy to identify.
Below I start with what I consider straightforward code, the kind a casual Julia user (including myself) might write. I begin with standard GP regression with one-dimensional inputs and outputs. I plan to address issues systematically, either within this thread or by opening a separate thread per issue (I’d appreciate your guidance on which approach is preferable).
The first issue that I want to address concerns a certain numerical problem that I have run into multiple times. Though the code may be inefficient in many respects, the focus is on isolating and hopefully resolving a very specific issue.
Issue 1: The code below implements standard GP regression for one-dimensional inputs and one-dimensional outputs using the RBF kernel. Function fitgp_1 implements two versions of the objective function to be maximised. The first version is called marginal_loglikelihood_MvNormal and uses Distributions.MvNormal to calculate the log-likelihood. The second version is called marginal_loglikelihood_explicit and calculates the log-likelihood explicitly. Both versions agree in that they calculate the same log-likelihood when given the same hyperparameters. However, when optimising using automatic gradients I get wildly different results. In particular, the version marginal_loglikelihood_explicit seems to suffer from numerical issues. It is not obvious why there is such a discrepancy between the two versions.
Why does the second version behave so differently?
using Distributions
using DifferentiationInterface
using LinearAlgebra
using Optim
using Random
import Mooncake
function toydata()
# generate some synthetic data, not important how exactly,
# just to have something to test the code on
rng = MersenneTwister(1234)
x = rand(rng, 1000)*10
y = sin.(x) + 0.1*randn(rng, length(x))
return y, x
end
function fitgp_1(t, x; iterations = 10, JITTER = 1e-6)
# t are the target outputs (here one-dimensional)
# x are the inputs (here one-dimensional)
# Initialise hyperparameters
logℓ, logα, logσ = 0.0, 0.0, 0.0
rbf(x, y, logℓ, logα) = exp(-abs2(x - y)*exp(logℓ))*exp(logα)
function calculatecovariance!(K, logℓ, logα)
# Can be done better, but point here it to understand where
# problems may arise with allocations and performance
# when writing own code
local N = length(x)
for i in 1:N
for j in 1:N
K[i, j] = rbf(x[i], x[j], logℓ, logα)
end
end
end
function unpack_parameters(params)
local logℓ, logα, logσ = params
return logℓ, logα, logσ
end
function marginal_loglikelihood_MvNormal(logℓ, logα, logσ)
# calculate the covariance matrix for the passed hyperparameters
local K = zeros(length(x), length(x))
calculatecovariance!(K, logℓ, logα)
# add noise variance to the diagonal and jitter for numerical stability
K += exp(logσ) * I + JITTER * I
# return log marginal likelihood of a Gaussian process for Gaussian likelihood
return logpdf(MvNormal(zeros(length(x)), K), t)
end
function marginal_loglikelihood_explicit(logℓ, logα, logσ)
# calculate the covariance matrix for the passed hyperparameters
local K = zeros(length(x), length(x))
calculatecovariance!(K, logℓ, logα)
# add noise variance to the diagonal and jitter for numerical stability
K += exp(logσ) * I + JITTER * I
# return log marginal likelihood of a Gaussian process for Gaussian likelihood
local L = cholesky(Symmetric(K)).L
return -0.5 * sum(abs2, L \ t) - sum(log, diag(L)) - 0.5*length(x)*log(2π)
end
# define negative log-likelihood for optimization
nll_MvNormal(params) = -marginal_loglikelihood_MvNormal(unpack_parameters(params)...)
nll_explicit(params) = -marginal_loglikelihood_explicit(unpack_parameters(params)...)
# Sanity check:
# show that the two implementations of the marginal log-likelihood
# differ only very slightly for the same hyperparameters.
let
initparam = randn(3)
@show nll_MvNormal(initparam)
@show nll_explicit(initparam)
end
# Optimise hyperparameters using LBFGS
opt = Optim.Options(iterations = iterations, show_trace = true, show_every = 1)
# optimise using the MvNormal implementation
optimize(nll_MvNormal, [logℓ, logα, logσ], LBFGS(), opt, autodiff=Mooncake.AutoMooncake())
# optimise using the explicit implementation
optimize(nll_explicit, [logℓ, logα, logσ], LBFGS(), opt, autodiff=Mooncake.AutoMooncake())
end
Provided that packages are locally available, you can copy-paste the above code and run it with:
y,x = toydata()
fitgp_1(y,x)
What you should see when running the above code is that the second optimisation run that uses marginal_loglikelihood_explicit runs into numerical issues and breaks prematurely.
My local setup is Julia Version 1.12.4:
(MooncakeGPExperiments) pkg> st
Status `~/tmp/MooncakeGPExperiments/Project.toml`
[a0c0ee7d] DifferentiationInterface v0.7.16
[31c24e10] Distributions v0.25.123
[da2b9cff] Mooncake v0.5.12
[429524aa] Optim v2.0.1
[37e2e46d] LinearAlgebra v1.12.0
[9a3f8284] Random v1.11.0