PCA Output?

Gunter_Faes · April 3, 2019, 11:13am

Thanks to your support I can now estimate my PCA model and get the following output (MultivariateStats):

PCA(indim = 1599, outdim = 3, principalratio = 0.99857)

The methods offered by PCA seem a little rudimentary to me. I’m just trying to explore the advantages of Julia over R and get in R this output:

> PCA_Modell <- principal(Rotwein_o_Q, 4)
> print(PCA_Modell, cut = 0.5, sort = TRUE, digits = 2)
Principal Components Analysis
Call: principal(r = Rotwein_o_Q, nfactors = 4)
Standardized loadings (pattern matrix) based upon correlation matrix
                     item   RC1   RC2   RC3   RC4   h2   u2 com
fixed.acidity           1  0.91                   0.85 0.15 1.1
density                 8  0.78                   0.80 0.20 1.6
citric.acid             3  0.75                   0.81 0.19 1.9
pH                      9 -0.73                   0.60 0.40 1.2
free.sulfur.dioxide     6        0.88             0.80 0.20 1.1
total.sulfur.dioxide    7        0.88             0.79 0.21 1.1
residual.sugar          4                         0.39 0.61 2.5
alcohol                11              0.78       0.69 0.31 1.3
volatile.acidity        2             -0.72       0.64 0.36 1.5
chlorides               5                    0.80 0.73 0.27 1.3
sulphates              10                    0.76 0.68 0.32 1.4

                       RC1  RC2  RC3  RC4
SS loadings           2.87 1.80 1.67 1.44
Proportion Var        0.26 0.16 0.15 0.13
Cumulative Var        0.26 0.42 0.58 0.71
Proportion Explained  0.37 0.23 0.21 0.19
Cumulative Proportion 0.37 0.60 0.81 1.00

Mean item complexity =  1.4
Test of the hypothesis that 4 components are sufficient.

The root mean square of the residuals (RMSR) is  0.09 
 with the empirical chi square  1473  with prob <  3.2e-303

Does Julia offer a similar edition? Because the loadings, SS loadings and Cumulative Var (and more information about the model) are important for model evaluation.

Thank you for your support and
best regards,
Günter

nilshg · April 3, 2019, 2:24pm

Not a PCA expert so unsure what exactly R is spitting out and how it relates to the output of Julia’s PCA, but there are

projection(PCA_modell)
principalvars(PCA_modell)
tresidualvar(PCA_modell)

according to the docs which seem to be related. I’m assuming you’re interested in this for some sort of model selection rather than in the print output itself - if this is actually about the printed output then you could look into defining a Base.show() method for the PCA type.

Gunter_Faes · April 3, 2019, 6:43pm

Thank you for yor reply!

I know the mentioned PCA methods, but they are far away from really speaking of a PCA.

As a data analyst, I need to see more or less at a glance what the loadings of each variable on the main components are like. Just as the R function principal does. Also, I have no idea (I can’t see anything in the docu) how to pass the number of possible principal components and the rotation as arguments to the PCA function.

At the moment I don’t see any comparison to the R-function, but hope that I’m wrong…!

aharoun · April 4, 2019, 1:36am

It looks like R function you are using comes from a package called psych and it does a combination of PCA and factor analysis. On the other hand fit(PCA, x,...) of MultivariateStats.jl does just PCA. Indeed if you check out prcomp or princomp, which are the main PCA implementations in R, they are very similar to Julia function in terms of what they output.

Output of fit(PCA, x, ...) may not look similar to that of principal in R at first glance, but they contain more or less the same information. Consider this example in R:

> pc <- principal(cov(iris[1:4]),4,rotate="none")
> pc
Principal Components Analysis
Call: principal(r = cov(iris[1:4]), nfactors = 4, rotate = "none")
Standardized loadings (pattern matrix) based upon correlation matrix
               PC1  PC2   PC3   PC4 h2       u2 com
Sepal.Length  0.89 0.36 -0.28 -0.04  1  4.4e-16 1.5
Sepal.Width  -0.46 0.88  0.09  0.02  1 -1.1e-15 1.5
Petal.Length  0.99 0.02  0.05  0.12  1 -1.0e-15 1.0
Petal.Width   0.96 0.06  0.24 -0.08  1 -1.1e-15 1.1

                       PC1  PC2  PC3  PC4
SS loadings           2.92 0.91 0.15 0.02
Proportion Var        0.73 0.23 0.04 0.01
Cumulative Var        0.73 0.96 0.99 1.00
Proportion Explained  0.73 0.23 0.04 0.01
Cumulative Proportion 0.73 0.96 0.99 1.00

The same analysis can be done in Julia:

julia> using MultivariateStats, RDatasets, StatsBase

julia> data = Matrix{Float64}(iris[1:4]);

julia> data = (data .- mean(data,dims = 1))./ std(data,dims=1); # scaled

julia> p = fit(PCA,data',maxoutdim=4,pratio=.999)
PCA(indim = 4, outdim = 4, principalratio = 1.00000)

julia> projection(p)
4×4 Array{Float64,2}:
 -0.521066  0.377418   -0.719566   0.261286
  0.269347  0.923296    0.244382  -0.12351 
 -0.580413  0.0244916   0.142126  -0.801449
 -0.564857  0.066942    0.634273   0.523597

julia> principalvars(p)
4-element Array{Float64,1}:
 2.9184978165319926 
 0.9140304714680704 
 0.14675687557131467
 0.02071483642861875

Notice that “SS loadings” from R is same as principalvars(p) in Juia, and “Loadings” from R is just a scaled version of projection(p) (multiply each column of projection with the corresponding principalvars^0.5).

Gunter_Faes · April 4, 2019, 6:54am

Thanks for the answer!

I quickly compared the issue with R & Iris and they agree. The output still needs a little “polishing” but for the beginning that’s ok. As I noticed at the beginning, I know that Julia offers everything about PCA, but the preparation and postprocessing is not very user-friendly.

nilshg · April 4, 2019, 9:18am

As you seem to be concerned about “polished” output and having output available “at a glance” you might want to look into overloading Base.show() for the PCA type to get the ouptut you want, e.g.:

julia> using MultivariateStats
julia> model = fit(PCA, rand(1000,10))

julia> Base.show(io::IO, x::PCA{Float64}) = begin
    println("Variance of principle components: ",x.prinvars)
    println("Total observation variance: ",x.tvar)
    println("Total residual variance :",tresidualvar(model))
end

julia> model
Variance of principle components: [9.92461, 9.61066, 9.20797, 8.44422, 8.11305, 7.85349, 7.58558, 7.07091, 6.88697]
Total observation variance: 74.69745337859435
Total residual variance :0.0

(Note that I haven’t thought at all about whether that’s useful information to display, it’s just to give you a general idea of what you could do - lots more info in the I/O and Network docs)

Gunter_Faes · April 4, 2019, 9:23am

Hello Nils (right?),

what can I say? Thanks for your support, I will try that!

Tamas_Papp · April 4, 2019, 11:08am

Which makes this a perfect opportunity for a PR — having resolved the issue, you could pass on the help you got to future users.

(Also, strictly speaking, it is not “Julia”, it is a Julia package.)

aharoun · April 4, 2019, 2:51pm

I think it is more related to the focus of the packages. R package psych is designed mainly around the practices in psychological research, where interpreting the factors and loadings is common practice. Therefore it presents its output in a certain way to facilitate this and includes functionalities like rotation. On the other hand, Julian MultivariateStats seems to be a more traditional machine learning package where the core practices and the focus are different. In this regard, it is more similar to R functions like prcomp and princomp.

Gunter_Faes · April 4, 2019, 5:36pm

This may all be correct, but in the end psych delivers a well interpretable output and that should be the essence from the user’s point of view.

The core of my solution (for learning purposes) is this (PCA_Data contains the observations, PCA_HK the estimated number of main components):

PCA_Modell = R"principal($PCA_Daten, $PCA_HK)"
PCA_scores =  R"principal($PCA_Daten, $PCA_HK)$scores"
show(PCA_Modell)

As you can see, I use R over RCall. And that is the output:

RObject{VecSxp}
Principal Components Analysis
Call: principal(r = `#JL`$PCA_Daten, nfactors = `#JL`$PCA_HK)
Standardized loadings (pattern matrix) based upon correlation matrix
     RC1   RC2   RC3   RC4   h2   u2 com
1   0.91 -0.13  0.07  0.02 0.85 0.15 1.1
2  -0.34 -0.02 -0.72 -0.13 0.64 0.36 1.5
3   0.75  0.04  0.44  0.23 0.81 0.19 1.9
4   0.35  0.46 -0.03 -0.25 0.39 0.61 2.5
5   0.13  0.00 -0.27  0.80 0.73 0.27 1.3
6  -0.14  0.88  0.03  0.07 0.80 0.20 1.1
7  -0.02  0.88 -0.11  0.09 0.79 0.21 1.1
8   0.78  0.10 -0.44  0.01 0.80 0.20 1.6
9  -0.73  0.01 -0.01 -0.25 0.60 0.40 1.2
10  0.15  0.06  0.27  0.76 0.68 0.32 1.4
11 -0.23 -0.11  0.78 -0.13 0.69 0.31 1.3

                       RC1  RC2  RC3  RC4
SS loadings           2.87 1.80 1.67 1.44
Proportion Var        0.26 0.16 0.15 0.13
Cumulative Var        0.26 0.42 0.58 0.71
Proportion Explained  0.37 0.23 0.21 0.19
Cumulative Proportion 0.37 0.60 0.81 1.00

Mean item complexity =  1.4
Test of the hypothesis that 4 components are sufficient.

The root mean square of the residuals (RMSR) is  0.09
 with the empirical chi square  1473.01  with prob <  3.2e-303

Fit based upon off diagonal values = 0.89

The output is easy to read and contains some information on the quality of model estimation. This is basic information for the data analyst.

The value of the latent variables (scores) is also supplied simply (well, Julia (or the function of the package) also supplies this). And by the way, the scree-graphic is delivered as well.

The effort I have to make as a user is quite small. Speaks from my point of view for the R-function. I still see potential for the MultivariateStats package.

Thank you for your support and greetings,
Günter

kevbonham · April 4, 2019, 6:31pm

Frankly, I think the outputs from R functions like this are way too verbose - most of the time I want a small fraction of what displayed and have to go hunting. Or I want to pass some data to a different function and the r package thinks the display is the endpoint without giving me easy access to the data. If this kind of display is useful for a Data analyst, maybe make a DataAnalystUtils package that generates nice displays for various models. I’m sure you’re not alone in wanting this, but I wouldn’t want it to be part of the core package.

Edit:

Also would be great to include some plot recipes like this in such a utility package.

Zach_Christensen · April 4, 2019, 10:11pm

I was looking for this exact feature two days ago. I agree that PCA interpretation should get a lot of attention in terms of plots and summary text somewhere.

nilshg · April 5, 2019, 3:31pm

None of this is very hard to implement though as was pointed out in this thread - overloading Base.show and implementing a Plots.jl recipe would do the trick; if you don’t feel comfortable doing this yourself maybe try opening an issue on the Multivariatestats GitHub, the maintainers might well be up for feature requests!

Gunter_Faes · April 5, 2019, 4:05pm

Interesting thoughts. I can understand that some R functions overwhelm the user with a lot of information. I consider the psych package an exception and it is therefore very popular. However, “hunting” for the right information is often very well supported by the $-operator. Here is an example: The loadings are explicitly output via Model$loadings. Usually very simple.

As far as the development of a package like the DataAnalystUtils is concerned, I am not yet experienced enough. For this I need much more experience. I think the idea is good and maybe you can give me a hint to keep the learning curve low?

kevbonham · April 5, 2019, 6:29pm

I don’t think this is true! Making a package is a great way to learn.

You could check out my MicrobiomePlots.jl for an example of how to make a plotting recipe (I made one for a PCoA scatter here) and others have mentioned how you could make show methods. Technically that would be type piracy I think, but you could just make new displaymodel functions or something.

There are lots of other examples too. Just copy someone and get started

Gunter_Faes · April 5, 2019, 6:51pm

Oh dear, that’s what I get for it! Thanks and I’ll have a look! But I know, sooner or later you can’t get around the package development…

Zach_Christensen · April 5, 2019, 7:11pm

Would StatsPlots be a more appropriate place for the plotting features discussed here?

Note: I ask because I need to make a really great PCA plot in the next couple days so I may as well contribute my work somewhere

kevbonham · April 5, 2019, 8:34pm

Probably, yes. I mean, IMO the best place for it would really be in the package that creates the type, but there have been several discussions and this is not the generally accepted view. Or, at least it’s not the view of the devs of several of the stats packages who have final say on the matter.

nalimilan · April 5, 2019, 8:51pm

I think a PR adding a more detailed output for PCA in MultivariateStats would be well received (unless it’s really verbose). We already print a lot of information (similar to R) in GLM for example.

StatsPlots sounds like the appropriate place where to define plot recipes for PCA.

Zach_Christensen · April 5, 2019, 10:17pm

Perhaps a nice compromise would be putting various options for details in show with low verbosity as the default (eg. show(..., verbose=false)). Personally, I have historically used the psych package and really liked the Scree plot and read out. I don’t think either of them were especially unique to psychology.

I think if there’s any additional manipulation to the data (e.g. unique projections of components) then a separate package would make sense, but creating additional packages for displaying information that’s already there seems odd to me.

Topic		Replies	Views
Dimensionality Reduction Packages in Julia Optimization (Mathematical)	27	4974	October 24, 2021
Problem realization of PCA Optimization (Mathematical) question	2	554	February 4, 2022
A question ahout PCA Optimization (Mathematical) question	9	1806	June 29, 2020
Package for KPCA Statistics	3	496	April 29, 2023
PCA in MultivariateStats Statistics	0	1535	November 14, 2016

PCA Output?

Related topics