# PCA Output?

Thanks to your support I can now estimate my PCA model and get the following output (MultivariateStats):

``````PCA(indim = 1599, outdim = 3, principalratio = 0.99857)
``````

The methods offered by PCA seem a little rudimentary to me. I’m just trying to explore the advantages of Julia over R and get in R this output:

``````> PCA_Modell <- principal(Rotwein_o_Q, 4)
> print(PCA_Modell, cut = 0.5, sort = TRUE, digits = 2)
Principal Components Analysis
Call: principal(r = Rotwein_o_Q, nfactors = 4)
item   RC1   RC2   RC3   RC4   h2   u2 com
fixed.acidity           1  0.91                   0.85 0.15 1.1
density                 8  0.78                   0.80 0.20 1.6
citric.acid             3  0.75                   0.81 0.19 1.9
pH                      9 -0.73                   0.60 0.40 1.2
free.sulfur.dioxide     6        0.88             0.80 0.20 1.1
total.sulfur.dioxide    7        0.88             0.79 0.21 1.1
residual.sugar          4                         0.39 0.61 2.5
alcohol                11              0.78       0.69 0.31 1.3
volatile.acidity        2             -0.72       0.64 0.36 1.5
chlorides               5                    0.80 0.73 0.27 1.3
sulphates              10                    0.76 0.68 0.32 1.4

RC1  RC2  RC3  RC4
Proportion Var        0.26 0.16 0.15 0.13
Cumulative Var        0.26 0.42 0.58 0.71
Proportion Explained  0.37 0.23 0.21 0.19
Cumulative Proportion 0.37 0.60 0.81 1.00

Mean item complexity =  1.4
Test of the hypothesis that 4 components are sufficient.

The root mean square of the residuals (RMSR) is  0.09
with the empirical chi square  1473  with prob <  3.2e-303

``````

Thank you for your support and
best regards,
Günter

Not a PCA expert so unsure what exactly R is spitting out and how it relates to the output of Julia’s PCA, but there are

``````projection(PCA_modell)
principalvars(PCA_modell)
tresidualvar(PCA_modell)
``````

according to the docs which seem to be related. I’m assuming you’re interested in this for some sort of model selection rather than in the print output itself - if this is actually about the printed output then you could look into defining a `Base.show()` method for the `PCA` type.

1 Like

I know the mentioned PCA methods, but they are far away from really speaking of a PCA.

As a data analyst, I need to see more or less at a glance what the loadings of each variable on the main components are like. Just as the R function principal does. Also, I have no idea (I can’t see anything in the docu) how to pass the number of possible principal components and the rotation as arguments to the PCA function.

At the moment I don’t see any comparison to the R-function, but hope that I’m wrong…!

1 Like

It looks like R function you are using comes from a package called `psych` and it does a combination of PCA and factor analysis. On the other hand `fit(PCA, x,...)` of `MultivariateStats.jl` does just PCA. Indeed if you check out `prcomp` or `princomp`, which are the main PCA implementations in R, they are very similar to Julia function in terms of what they output.

Output of `fit(PCA, x, ...)` may not look similar to that of `principal` in R at first glance, but they contain more or less the same information. Consider this example in R:

``````> pc <- principal(cov(iris[1:4]),4,rotate="none")
> pc
Principal Components Analysis
Call: principal(r = cov(iris[1:4]), nfactors = 4, rotate = "none")
PC1  PC2   PC3   PC4 h2       u2 com
Sepal.Length  0.89 0.36 -0.28 -0.04  1  4.4e-16 1.5
Sepal.Width  -0.46 0.88  0.09  0.02  1 -1.1e-15 1.5
Petal.Length  0.99 0.02  0.05  0.12  1 -1.0e-15 1.0
Petal.Width   0.96 0.06  0.24 -0.08  1 -1.1e-15 1.1

PC1  PC2  PC3  PC4
Proportion Var        0.73 0.23 0.04 0.01
Cumulative Var        0.73 0.96 0.99 1.00
Proportion Explained  0.73 0.23 0.04 0.01
Cumulative Proportion 0.73 0.96 0.99 1.00
``````

The same analysis can be done in Julia:

``````julia> using MultivariateStats, RDatasets, StatsBase

julia> data = Matrix{Float64}(iris[1:4]);

julia> data = (data .- mean(data,dims = 1))./ std(data,dims=1); # scaled

julia> p = fit(PCA,data',maxoutdim=4,pratio=.999)
PCA(indim = 4, outdim = 4, principalratio = 1.00000)

julia> projection(p)
4×4 Array{Float64,2}:
-0.521066  0.377418   -0.719566   0.261286
0.269347  0.923296    0.244382  -0.12351
-0.580413  0.0244916   0.142126  -0.801449
-0.564857  0.066942    0.634273   0.523597

julia> principalvars(p)
4-element Array{Float64,1}:
2.9184978165319926
0.9140304714680704
0.14675687557131467
0.02071483642861875
``````

Notice that “SS loadings” from R is same as `principalvars(p)` in Juia, and “Loadings” from R is just a scaled version of `projection(p)` (multiply each column of `projection` with the corresponding `principalvars^0.5`).

8 Likes

I quickly compared the issue with R & Iris and they agree. The output still needs a little “polishing” but for the beginning that’s ok. As I noticed at the beginning, I know that Julia offers everything about PCA, but the preparation and postprocessing is not very user-friendly.

1 Like

As you seem to be concerned about “polished” output and having output available “at a glance” you might want to look into overloading `Base.show()` for the `PCA` type to get the ouptut you want, e.g.:

``````julia> using MultivariateStats
julia> model = fit(PCA, rand(1000,10))

julia> Base.show(io::IO, x::PCA{Float64}) = begin
println("Variance of principle components: ",x.prinvars)
println("Total observation variance: ",x.tvar)
println("Total residual variance :",tresidualvar(model))
end

julia> model
Variance of principle components: [9.92461, 9.61066, 9.20797, 8.44422, 8.11305, 7.85349, 7.58558, 7.07091, 6.88697]
Total observation variance: 74.69745337859435
Total residual variance :0.0
``````

(Note that I haven’t thought at all about whether that’s useful information to display, it’s just to give you a general idea of what you could do - lots more info in the I/O and Network docs)

4 Likes

Hello Nils (right?),

what can I say? Thanks for your support, I will try that!

1 Like

Which makes this a perfect opportunity for a PR — having resolved the issue, you could pass on the help you got to future users.

(Also, strictly speaking, it is not “Julia”, it is a Julia package.)

1 Like

I think it is more related to the focus of the packages. R package `psych` is designed mainly around the practices in psychological research, where interpreting the factors and loadings is common practice. Therefore it presents its output in a certain way to facilitate this and includes functionalities like rotation. On the other hand, Julian `MultivariateStats` seems to be a more traditional machine learning package where the core practices and the focus are different. In this regard, it is more similar to R functions like `prcomp` and `princomp`.

This may all be correct, but in the end psych delivers a well interpretable output and that should be the essence from the user’s point of view.

The core of my solution (for learning purposes) is this (PCA_Data contains the observations, PCA_HK the estimated number of main components):

``````PCA_Modell = R"principal(\$PCA_Daten, \$PCA_HK)"
PCA_scores =  R"principal(\$PCA_Daten, \$PCA_HK)\$scores"
show(PCA_Modell)
``````

As you can see, I use R over RCall. And that is the output:

``````RObject{VecSxp}
Principal Components Analysis
Call: principal(r = `#JL`\$PCA_Daten, nfactors = `#JL`\$PCA_HK)
RC1   RC2   RC3   RC4   h2   u2 com
1   0.91 -0.13  0.07  0.02 0.85 0.15 1.1
2  -0.34 -0.02 -0.72 -0.13 0.64 0.36 1.5
3   0.75  0.04  0.44  0.23 0.81 0.19 1.9
4   0.35  0.46 -0.03 -0.25 0.39 0.61 2.5
5   0.13  0.00 -0.27  0.80 0.73 0.27 1.3
6  -0.14  0.88  0.03  0.07 0.80 0.20 1.1
7  -0.02  0.88 -0.11  0.09 0.79 0.21 1.1
8   0.78  0.10 -0.44  0.01 0.80 0.20 1.6
9  -0.73  0.01 -0.01 -0.25 0.60 0.40 1.2
10  0.15  0.06  0.27  0.76 0.68 0.32 1.4
11 -0.23 -0.11  0.78 -0.13 0.69 0.31 1.3

RC1  RC2  RC3  RC4
Proportion Var        0.26 0.16 0.15 0.13
Cumulative Var        0.26 0.42 0.58 0.71
Proportion Explained  0.37 0.23 0.21 0.19
Cumulative Proportion 0.37 0.60 0.81 1.00

Mean item complexity =  1.4
Test of the hypothesis that 4 components are sufficient.

The root mean square of the residuals (RMSR) is  0.09
with the empirical chi square  1473.01  with prob <  3.2e-303

Fit based upon off diagonal values = 0.89
``````

The output is easy to read and contains some information on the quality of model estimation. This is basic information for the data analyst.

The value of the latent variables (scores) is also supplied simply (well, Julia (or the function of the package) also supplies this). And by the way, the scree-graphic is delivered as well.

The effort I have to make as a user is quite small. Speaks from my point of view for the R-function. I still see potential for the MultivariateStats package.

Thank you for your support and greetings,
Günter

1 Like

Frankly, I think the outputs from R functions like this are way too verbose - most of the time I want a small fraction of what displayed and have to go hunting. Or I want to pass some data to a different function and the r package thinks the display is the endpoint without giving me easy access to the data. If this kind of display is useful for a Data analyst, maybe make a DataAnalystUtils package that generates nice displays for various models. I’m sure you’re not alone in wanting this, but I wouldn’t want it to be part of the core package.

Edit:

Also would be great to include some plot recipes like this in such a utility package.

4 Likes

I was looking for this exact feature two days ago. I agree that PCA interpretation should get a lot of attention in terms of plots and summary text somewhere.

3 Likes

None of this is very hard to implement though as was pointed out in this thread - overloading `Base.show` and implementing a `Plots.jl` recipe would do the trick; if you don’t feel comfortable doing this yourself maybe try opening an issue on the `Multivariatestats` GitHub, the maintainers might well be up for feature requests!

Interesting thoughts. I can understand that some R functions overwhelm the user with a lot of information. I consider the psych package an exception and it is therefore very popular. However, “hunting” for the right information is often very well supported by the \$-operator. Here is an example: The loadings are explicitly output via Model\$loadings. Usually very simple.

As far as the development of a package like the DataAnalystUtils is concerned, I am not yet experienced enough. For this I need much more experience. I think the idea is good and maybe you can give me a hint to keep the learning curve low?

I don’t think this is true! Making a package is a great way to learn.

You could check out my MicrobiomePlots.jl for an example of how to make a plotting recipe (I made one for a PCoA scatter here) and others have mentioned how you could make `show` methods. Technically that would be type piracy I think, but you could just make new `displaymodel` functions or something.

There are lots of other examples too. Just copy someone and get started

2 Likes

Oh dear, that’s what I get for it! Thanks and I’ll have a look! But I know, sooner or later you can’t get around the package development…

1 Like

Would StatsPlots be a more appropriate place for the plotting features discussed here?

Note: I ask because I need to make a really great PCA plot in the next couple days so I may as well contribute my work somewhere

1 Like

Probably, yes. I mean, IMO the best place for it would really be in the package that creates the type, but there have been several discussions and this is not the generally accepted view. Or, at least it’s not the view of the devs of several of the stats packages who have final say on the matter.

I think a PR adding a more detailed output for PCA in MultivariateStats would be well received (unless it’s really verbose). We already print a lot of information (similar to R) in GLM for example.

StatsPlots sounds like the appropriate place where to define plot recipes for PCA.

2 Likes

Perhaps a nice compromise would be putting various options for details in show with low verbosity as the default (eg. `show(..., verbose=false)`). Personally, I have historically used the psych package and really liked the Scree plot and read out. I don’t think either of them were especially unique to psychology.

I think if there’s any additional manipulation to the data (e.g. unique projections of components) then a separate package would make sense, but creating additional packages for displaying information that’s already there seems odd to me.