Principal Components Analysis on datasets with missing data

Hey all -

I’m sort of new to principal components analysis - especially in Julia, but I’ve been digging through @nassarhuda’s great Dimensionality Reduction notebook but am running into an issue because the data I’m working with has a good number of missing values. Here’s what it looks like:

This is first an issue with the normalization the notebook recommends. It looks like I can’t do the normalization in the presence of missing values - and just get back a matrix full of missing values.

Obviously, this is not what I need. But I can drop out my missing data and do the same normalization process on what remains:

I then have no trouble running the PCA, but I am not sure how I can get this matched back on to my original dataframe, since it doesn’t have indices or the same dimensions as the original dataset. I also would love to be able to include observations with some missing data on some columns. Is that possible?

Any guidance would be greatly appreciated!

PCA with missing data is nontrivial, you have to make some assumptions.

Personally I would recommend implementing Bayesian PCA, you will find many sources online. But there are also other alternatives.

2 Likes

https://github.com/madeleineudell/LowRankModels.jl
handles PCA with missing data

8 Likes

Somehow I had never seen this package. It seems very well made - thanks for this.

1 Like

There is the function skipmissings in Missings.jl.

t = skipmissings(eachcol(mat)...)

then you can iterate through the indices of t to determine which rows in the data don’t have missing observations.

No guarantees about the speed of this approach though, but if you are only doing it once it should be okay.

EDIT: t is a tuple, if that isn’t clear by running the code. You want the indices of the first element of t, which is an iterator.

1 Like

It is quite useful. Unfortunately, it has historically been maintained only very sporadically and many have kept their own forks of it as a workaround. I noticed a large PR was merged recently to brush it up, hopefully it will keep on track in the future :slight_smile:

3 Likes