Hello everyone,
I am working on a model called the Gaussian process latent variable model (GPLVM). The GPLVM is a dimensionality reduction method which given a high dimensional dataset Y will return low-dimensional projections X.
In order to optimise the free parameters, I use Optim.jl in conjunction with DifferentiationInterface.jl and Mooncake.jl that automatically calculate the gradients for me. While I am happy with this setup, I have noticed a very high memory consumption: the top utility informs me that more than 20% of my 32GB memory is in use, and this just for a modest number of data items. While in principle I don’t mind this, if I increase the number of data items, more memory is consumed and the system will kill my julia session.
My actual code is quite lengthy. For the purposes of this question, I have cut it down significantly, but it still exceeds the length of a typical MWE. The code is organised in three functions:
- function
gplvmis called by the user. It sets up the optimiser options, sets up the gradient, calls the optimiser and finally returns the low dimensional coordinatesXinferred by the GPLVM model. - function
negativemarginallikelihoodis the objective function to be minimised with respect to the coordinatesX, the parametersθof the Gaussian process kernel and the noise varianceσ². If you are not familiar with the GPLVM, but know about Gaussian process, then this resembles very strongly the negative marginal log-likelihood of a regular Gaussian process regression model. - function
unpack_gplvmis an auxiliary function that converts the vector of free, unconstrained parameterspto the parametersX,θandσ².
using DifferentiationInterface
using Distances
using Distributions
using LinearAlgebra
import Mooncake
using Optim
using Random
"""
Y are the D×N high-dimensional data points
iterations is the number of iterations of the optimisation algorithm
Q is the dimensionality of the latent space
"""
function gplvm(Y; iterations = 1, Q = 2)
# Get number of data items
D, N = size(Y)
# Allocate once zero vector necessary for marginal likelihood
# calculation of zero-mean Gaussian process
zerovector = zeros(N)
# Initialise parameters randomly:
# first Q*N elements are the N latent Q-dimensional projections X
# next 2 elements are kernel parameters - take log here because unpack function uses exp to ensure positivity
# last parameter is the noise variance - take log here because unpack function uses exp to ensure positivity
rng = MersenneTwister(1234)
initialsol = [randn(rng, Q*N)*0.1; log(1.0); log(1.0); log(1.0)]
# pre-allocate N×N covariance matrix K
K = zeros(N, N)
# setup optimiser options
opt = Optim.Options(iterations = iterations, show_trace = true, show_every = 1)
# use DifferentiationInterface to get gradients
# Comment in lines below to use Mooncake and comment out following block that uses Enzyme
backend = AutoMooncake(config = nothing)
prep = prepare_gradient(negativemarginallikelihood, backend, initialsol, Constant(Y), Constant(zerovector), Cache(K), Constant(D), Constant(Q), Constant(N))
gradhelper!(grad, p) = DifferentiationInterface.gradient!(negativemarginallikelihood, grad, prep, backend, p, Constant(Y), Constant(zerovector), Cache(K), Constant(D), Constant(Q), Constant(N))
helper(p) = negativemarginallikelihood(p, Y, zerovector, K, D, Q, N)
# Comment in lines below to use Enzyme and comment out above block that uses Mooncake
# backend = AutoEnzyme()
# helper(p) = negativemarginallikelihood(p, Y, zerovector, K, D, Q, N)
# prep = prepare_gradient(helper, backend, initialsol)
# gradhelper!(grad, p) = DifferentiationInterface.gradient!(helper, grad, prep, backend, p)
# call actual optimisation
finalsolution = optimize(helper, gradhelper!, initialsol, ConjugateGradient(), opt).minimizer
# obtain optimised latent
X = unpack_gplvm(finalsolution, Q, N)[1]
# return projections
return X
end
# Negative marginal likelihood function of GPLVM.
# We want to minimise this.
function negativemarginallikelihood(p, Y, zerovector, K, D, Q, N)
# extract parameters from vector p
X, θ, σ² = unpack_gplvm(p, Q, N)
# calculate pairwise squared Euclidean distances.
# Obviously, a more efficient implementation is possible.
for n in 1:N
for m in 1:N
@views K[n, m] = sum((X[:, n] - X[:, m]).^2)
end
end
# ovewrite K entries with covariance matrix elements
for n in eachindex(K)
K[n] = θ[1] * exp(-0.5 * K[n] / θ[2])
end
# add jitter on diagonal
for n in 1:N
K[n, n] += 1e-6
end
# accummulate here log likelihood over D dimensions
accloglikel = zero(eltype(p))
# instiantiate multivariate normal distribution
mvn = MvNormal(zerovector, K + σ²*I)
# iterate over D dimensions
for d in 1:D
# calculate log likelihood of d-th dimension
accloglikel += @views logpdf(mvn, Y[d, :])
end
# return negative log marginal likelihood
-1.0 * accloglikel
end
# Given parameters flattened in p, unpack them into X, θ and σ²
function unpack_gplvm(p, Q, N)
MARK = 0
# First Q*N elements are the N latent Q-dimensional projections X
X = reshape(p[MARK+1:MARK+Q*N], Q, N); MARK += Q*N
# Next two elements are kernel parameters
θ = exp.(p[MARK+1:MARK+2]); MARK += 2
# The last parameter is the noise variance
σ² = exp(p[MARK+1]); MARK += 1
return X, θ, σ²
end
To execute the code and observe the high memory consumption, we can simply call it with randomly generated data:
Y = randn(12, 1000) # 1000 data items with 12 features
X = gplvm(Y; iterations = 1) # warmup
X = gplvm(Y; iterations = 100) # note memory consumption during execution
If I run this:
Y = randn(12, 10_000) # 10000 data items with 12 features
X = gplvm(Y; iterations = 100)
the system will kill the julia session.
I strongly believe the that high memory consumption is related to the use of DifferentiationInterface.jl and Mooncake.jl. I thought that the use of contexts would help, but unfortunately it didn’t help. I have gone through the documentation, but I can’t see if I am missing something. Does anyone have any advice on how I could reduce my memory footprint? Thanks for tolerating this very long question.
Perhaps useful: I am using Julia Version 1.11.4 with Ubuntu 22.04.5 LTS
Update: updated code so that use of Enzyme can be commented in and Mooncake can be commented out. This is in response to comment below.
Update: introduce new function that uses gradient-free NelderMead optimiser:
"""
Same as gplvm but uses the gradient-free NelderMead optimiser.
"""
function gplvm_gradient_free(Y; iterations = 1, Q = 2)
# Get number of data items
D, N = size(Y)
# Allocate once zero vector necessary for marginal likelihood
# calculation of zero-mean Gaussian process
zerovector = zeros(N)
# Initialise parameters randomly:
# first Q*N elements are the N latent Q-dimensional projections X
# next 2 elements are kernel parameters - take log here because unpack function uses exp to ensure positivity
# last parameter is the noise variance - take log here because unpack function uses exp to ensure positivity
rng = MersenneTwister(1234)
initialsol = [randn(rng, Q*N)*0.1; log(1.0); log(1.0); log(1.0)]
# pre-allocate N×N covariance matrix K
K = zeros(N, N)
# setup optimiser options
opt = Optim.Options(iterations = iterations, show_trace = true, show_every = 1)
# objective function to be optimised
helper(p) = negativemarginallikelihood(p, Y, zerovector, K, D, Q, N)
# call actual optimisation
finalsolution = optimize(helper, initialsol, NelderMead(), opt).minimizer
# obtain optimised latent
X = unpack_gplvm(finalsolution, Q, N)[1]
# return projections
return X
end