Robust Mahalanobis distance in high-dimensions with limited samples


I would like to use an iterative algorithm to estimate the parameters of a t-distribution \pi_X characterized by a mean \mu \in \mathbb{R}^n, a PSD scale matrix C \in \mathbb{R}^{Nx \times Nx}, and a degree of freedom \nu>2.

I have access to M i.i.d. samples \{ x^1, \ldots, x^M \} \sim \pi_X. I am usually in the low-data setting where M \leq Nx.

To generate the initial condition, I need to compute the Mahalanobis distance of the samples using the sample mean and covariance as the best estimates for the mean and scale matrix.

In the following example, Nx = 60 and M = 40, the Mahalanobis distance is almost identical for all the samples due to the poor conditioning of the sample scale matrix. This issue doesn’t appear in smaller dimensions.

I would greatly appreciate your help to regularize those estimates. Here is a MWE:

using LinearAlgebra
using Statistics
using PDMats

Nx = 60
νX = 5.0
μX = zeros(Nx)
CX = PDiagMat(ones(Nx))
πX = Distributions.GenericMvTDist(νX, μX, CX)

M = 40
X = rand(πX, M)

μXhat = mean(X; dims = 2)[:,1]
CXhat = cov(X') + 1e-4*I

# Fast implementation
LXhat = cholesky(Positive, CXhat).L
δXhat = sum.(abs2, eachcol(LXhat \ (X .- μXhat)));

# Slow implementation
δXtest = zeros(M)
for i=1:M
    δXtest[i] = (X[:,i] - μXhat)'*inv(CXhat)*(X[:,i] - μXhat)

# We obtain the same results


1×40 adjoint(::Vector{Float64}) with eltype Float64: 38.025 38.025 38.025 38.025 38.025 … 38.025 38.025 38.025 38.025



Maybe some adaptation of the ideas of the graphical lasso would be useful for this?

1 Like

Trying to understand the problem better: There are M samples (of dimension Nx), and the parameters are mu (of dimension Nx) and Cx (of dimension Nx * Nx).
It seems there are more parameters than samples and there is possible overfitting. Specifically, the sample mean and covariance overfit the distribution, giving (as calculated in the post), the same distance for each sample point (which would translate to the same likelihood).
Is this a fair description? If there is overfitting, more constraints (such as sparsity) or a regularizer (such as L1 in lasso) can be used to select one set of parameters over others.

Just to support this overfitting view, the resulting 38.025 distance for each sample, seems to be (M-1)^2/M analytically and independent of Nx (which is logical since the samples are in an M dimensional subspace).

1 Like