Principal Components Analysis on datasets with missing data

clibassi · October 24, 2020, 11:57pm

Hey all -

I’m sort of new to principal components analysis - especially in Julia, but I’ve been digging through @nassarhuda’s great Dimensionality Reduction notebook but am running into an issue because the data I’m working with has a good number of missing values. Here’s what it looks like:

This is first an issue with the normalization the notebook recommends. It looks like I can’t do the normalization in the presence of missing values - and just get back a matrix full of missing values.

Obviously, this is not what I need. But I can drop out my missing data and do the same normalization process on what remains:

I then have no trouble running the PCA, but I am not sure how I can get this matched back on to my original dataframe, since it doesn’t have indices or the same dimensions as the original dataset. I also would love to be able to include observations with some missing data on some columns. Is that possible?

Any guidance would be greatly appreciated!

Tamas_Papp · October 25, 2020, 8:44am

PCA with missing data is nontrivial, you have to make some assumptions.

Personally I would recommend implementing Bayesian PCA, you will find many sources online. But there are also other alternatives.

baggepinnen · October 25, 2020, 9:26am

https://github.com/madeleineudell/LowRankModels.jl
handles PCA with missing data

anon92994695 · October 25, 2020, 12:06pm

Somehow I had never seen this package. It seems very well made - thanks for this.

pdeffebach · October 25, 2020, 1:55pm

There is the function skipmissings in Missings.jl.

t = skipmissings(eachcol(mat)...)

then you can iterate through the indices of t to determine which rows in the data don’t have missing observations.

No guarantees about the speed of this approach though, but if you are only doing it once it should be okay.

EDIT: t is a tuple, if that isn’t clear by running the code. You want the indices of the first element of t, which is an iterator.

baggepinnen · October 25, 2020, 2:16pm

It is quite useful. Unfortunately, it has historically been maintained only very sporadically and many have kept their own forks of it as a workaround. I noticed a large PR was merged recently to brush it up, hopefully it will keep on track in the future

Topic		Replies	Views
PCA not working Statistics	13	1353	April 3, 2019
Skipmissing no working in cor function New to Julia question	5	1073	November 11, 2021
Converting missing type into float Data data , type	5	3414	December 13, 2017
R's na.rm equivalent in Julia for ignoring missing values during inference? New to Julia question	4	206	May 17, 2023
Problem realization of PCA Optimization (Mathematical) question	2	554	February 4, 2022

Principal Components Analysis on datasets with missing data

Related topics