I am working on a model involving Gaussian processes and I am optimising it using Optim.jl while obtaining gradients via DifferentiationInterface.jl. Unfortunately, when I run the code, the top
utility informs me that the code uses a lot of memory.
While investigating, I noticed that even the simple model of Gaussian process regression uses a lot of memory when using Optim.jl with DifferentiationInterface.jl. I post some code below as a MWE that can be copy-pasted and executed (provided packages are available):
using DifferentiationInterface
using Distributions
using LinearAlgebra
import Mooncake
using Optim
using Random
function gp(x, y; iterations = 1)
# Get number of data items
N = length(y)
# Allocate once zero vector necessary for marginal likelihood
zeromean = zeros(N)
# Initialise parameters randomly
rng = MersenneTwister(1234)
initialsol = randn(rng, 3)
# pre-allocate N×N covariance matrix K
K = zeros(N, N)
# setup optimiser options
opt = Optim.Options(iterations = iterations, show_trace = true, show_every = 1)
helper(p) = negativemarginallikelihood_gp(p, K, x, y, zeromean)
# use DifferentiationInterface to get gradients
backend = AutoMooncake(config = nothing)
prep = prepare_gradient(negativemarginallikelihood_gp, backend, initialsol, Cache(K), Constant(x), Constant(y), Constant(zeromean))
gradhelper!(grad, p) = DifferentiationInterface.gradient!(negativemarginallikelihood_gp, grad, prep, backend, p, Cache(K), Constant(x), Constant(y), Constant(zeromean))
optimize(helper, gradhelper!, initialsol, ConjugateGradient(), opt).minimizer
end
# Negative marginal likelihood function of gp
function negativemarginallikelihood_gp(p, K, x, y, zeromean)
N = length(y)
θ = exp.(p) # make parameters positive
# Calculate covariance matrix
# Better implementation are of course possible
for m in 1:N
for n in 1:N
K[n, m] = θ[1] * exp(-0.5 * abs2(x[n] - x[m])/ θ[2])
end
end
# add jitter on diagonal
for n in 1:N
K[n, n] += 1e-6
end
# Return negative log marginal likelihood.
# We want to minimise this.
return -logpdf(MvNormal(zeromean, K + θ[3]*I), y)
end
We can run the code with fake data:
x = randn(MersenneTwister(1234), 1000);
y = sin.(x) + 0.01*randn(MersenneTwister(1234), 1000);
gp(x,y; iterations = 1) # warmup
gp(x,y; iterations = 100) # top utility tells me that Julia uses ~8-9% of my 32GB memory.
If I try to run the code with 10_000
data items like below:
x = randn(MersenneTwister(1234), 10_000);
y = sin.(x) + 0.01*randn(MersenneTwister(1234), 10_000);
gp(x,y; iterations = 100)
Julia will run out of memory and crash.
My questions:
- Is there something wrong with the above code?
- Alternatively: does anyone have any successful examples of optimising a Gaussian process regression model using automatic reverse differentiation?
Update: I should clarify that I am not tied to reverse differentiation and I will use anything that does the job. However, the objective I am optimising (in my actual problem, not the MWE above) is a scalar function f(x):Rᴹ→R, where M are the number of free parameters and M is typically in the order of 5000 to 15000 depending on the dataset.