I have a matrix of absorbance readings (one column per item) at different wavenumbers (rows). I want to enlarge my dataset with more synthetic data, and I am trying to fit an MvNormal to my matrix -as opposed to fitting a univariate Normal to each wavenumber separately. I am getting the following error:
Distributions.fit(MvNormal, myMatrix)
> ERROR: PosDefException: matrix is not positive definite; Cholesky factorization failed.
My matrix can’t be positive definite because it’s not a square matrix - the number of wavenumbers I am considering is not the same as the number of items I have scanned. Does this mean that I simply can’t fit a multivariate normal distribution to my data?
I’m pretty sure the columns are okay (I tend to double check these things many times). I also have fewer observations than wavelengths, which is the reason why I want to generate an extension of the dataset
You have more variables than observations so you’re hitting a problem with Distributions.jl: it only supports covariance matrices that are positive definite.
In your case the observation vectors are obviously linearly dependent so the covariance matrix is positive semidefinite. This is easy to reproduce:
julia> MvNormal(cov(randn(3, 5)))
ERROR: PosDefException: matrix is not positive definite; Cholesky factorization failed.
although as I understand that’s quite a hack: PSDMat is a subtype of AbstractPDMat so it pretends the matrix is positive definite. I don’t know if that can cause problems elsewhere…
I am no expert in linear algebra so I’m not entirely sure of the implications of PSDMating my covariate matrix to allow its use for the construction of an MvNormal, but it worked and I am now able to draw samples from that distribution that resemble real samples quite well.
Only one note: when doing cov(myMatrix), it is necessary to specify cov(myMatrix, dims=2).
Just to make sure I understand what is going on:
Fitting a Normal distribution to a vector of numbers is equivalent to constructing an N(mu, sigma^2) where mu is the mean of the vector and sigma^2 is its variance. Translating this to the present case of fitting a multivariate normal to my myMatrix (size pxn); Distributions.fit(MvNormal, myMatrix) is equivalent to MvNormal(mu, sigma) where mu is a vector (px1) containing the mean of each of its rows (variables) and sigma is its covariance matrix along the second dimension (size pxp). In order to allow this covariance matrix to be used to construct an MvNormal, it has to be of type PSDMat.
This is the final working code:
using PDMatsExtras
μ = mean(myMatrix, dims = 2) |> vec
σ = PSDMat(cov(myMatrix, dims = 2))
multiNormal = MvNormal(μ, σ)
new = rand(multiNormal, 100) # draw 100 elements from this distribution
plot(new[36]) #plot some random element to check visually that it resembles the original data
Yes that’s right! A few details: dims=2 is required because your observations are along the second dimension; a vector in Julia is not the same as a px1 matrix (that’s why you need the vec in your example); the PSDMat type is required only if the matrix happens to be positive semidefinite; and while we’re in the details: the usual symbol for the covariance matrix is an uppercase sigma: Σ.
Also I think to get the exactly the same result as fit you would need to call cov with corrected=false (not sure if that’s intended behavior or a bug in fit).