The methods offered by PCA seem a little rudimentary to me. I’m just trying to explore the advantages of Julia over R and get in R this output:
> PCA_Modell <- principal(Rotwein_o_Q, 4)
> print(PCA_Modell, cut = 0.5, sort = TRUE, digits = 2)
Principal Components Analysis
Call: principal(r = Rotwein_o_Q, nfactors = 4)
Standardized loadings (pattern matrix) based upon correlation matrix
item RC1 RC2 RC3 RC4 h2 u2 com
fixed.acidity 1 0.91 0.85 0.15 1.1
density 8 0.78 0.80 0.20 1.6
citric.acid 3 0.75 0.81 0.19 1.9
pH 9 -0.73 0.60 0.40 1.2
free.sulfur.dioxide 6 0.88 0.80 0.20 1.1
total.sulfur.dioxide 7 0.88 0.79 0.21 1.1
residual.sugar 4 0.39 0.61 2.5
alcohol 11 0.78 0.69 0.31 1.3
volatile.acidity 2 -0.72 0.64 0.36 1.5
chlorides 5 0.80 0.73 0.27 1.3
sulphates 10 0.76 0.68 0.32 1.4
RC1 RC2 RC3 RC4
SS loadings 2.87 1.80 1.67 1.44
Proportion Var 0.26 0.16 0.15 0.13
Cumulative Var 0.26 0.42 0.58 0.71
Proportion Explained 0.37 0.23 0.21 0.19
Cumulative Proportion 0.37 0.60 0.81 1.00
Mean item complexity = 1.4
Test of the hypothesis that 4 components are sufficient.
The root mean square of the residuals (RMSR) is 0.09
with the empirical chi square 1473 with prob < 3.2e-303
Does Julia offer a similar edition? Because the loadings, SS loadings and Cumulative Var (and more information about the model) are important for model evaluation.
Thank you for your support and
best regards,
Günter
according to the docs which seem to be related. I’m assuming you’re interested in this for some sort of model selection rather than in the print output itself - if this is actually about the printed output then you could look into defining a Base.show() method for the PCA type.
I know the mentioned PCA methods, but they are far away from really speaking of a PCA.
As a data analyst, I need to see more or less at a glance what the loadings of each variable on the main components are like. Just as the R function principal does. Also, I have no idea (I can’t see anything in the docu) how to pass the number of possible principal components and the rotation as arguments to the PCA function.
At the moment I don’t see any comparison to the R-function, but hope that I’m wrong…!
It looks like R function you are using comes from a package called psych and it does a combination of PCA and factor analysis. On the other hand fit(PCA, x,...) of MultivariateStats.jl does just PCA. Indeed if you check out prcomp or princomp, which are the main PCA implementations in R, they are very similar to Julia function in terms of what they output.
Output of fit(PCA, x, ...) may not look similar to that of principal in R at first glance, but they contain more or less the same information. Consider this example in R:
> pc <- principal(cov(iris[1:4]),4,rotate="none")
> pc
Principal Components Analysis
Call: principal(r = cov(iris[1:4]), nfactors = 4, rotate = "none")
Standardized loadings (pattern matrix) based upon correlation matrix
PC1 PC2 PC3 PC4 h2 u2 com
Sepal.Length 0.89 0.36 -0.28 -0.04 1 4.4e-16 1.5
Sepal.Width -0.46 0.88 0.09 0.02 1 -1.1e-15 1.5
Petal.Length 0.99 0.02 0.05 0.12 1 -1.0e-15 1.0
Petal.Width 0.96 0.06 0.24 -0.08 1 -1.1e-15 1.1
PC1 PC2 PC3 PC4
SS loadings 2.92 0.91 0.15 0.02
Proportion Var 0.73 0.23 0.04 0.01
Cumulative Var 0.73 0.96 0.99 1.00
Proportion Explained 0.73 0.23 0.04 0.01
Cumulative Proportion 0.73 0.96 0.99 1.00
Notice that “SS loadings” from R is same as principalvars(p) in Juia, and “Loadings” from R is just a scaled version of projection(p) (multiply each column of projection with the corresponding principalvars^0.5).
I quickly compared the issue with R & Iris and they agree. The output still needs a little “polishing” but for the beginning that’s ok. As I noticed at the beginning, I know that Julia offers everything about PCA, but the preparation and postprocessing is not very user-friendly.
As you seem to be concerned about “polished” output and having output available “at a glance” you might want to look into overloading Base.show() for the PCA type to get the ouptut you want, e.g.:
julia> using MultivariateStats
julia> model = fit(PCA, rand(1000,10))
julia> Base.show(io::IO, x::PCA{Float64}) = begin
println("Variance of principle components: ",x.prinvars)
println("Total observation variance: ",x.tvar)
println("Total residual variance :",tresidualvar(model))
end
julia> model
Variance of principle components: [9.92461, 9.61066, 9.20797, 8.44422, 8.11305, 7.85349, 7.58558, 7.07091, 6.88697]
Total observation variance: 74.69745337859435
Total residual variance :0.0
(Note that I haven’t thought at all about whether that’s useful information to display, it’s just to give you a general idea of what you could do - lots more info in the I/O and Network docs)
I think it is more related to the focus of the packages. R package psych is designed mainly around the practices in psychological research, where interpreting the factors and loadings is common practice. Therefore it presents its output in a certain way to facilitate this and includes functionalities like rotation. On the other hand, Julian MultivariateStats seems to be a more traditional machine learning package where the core practices and the focus are different. In this regard, it is more similar to R functions like prcomp and princomp.
As you can see, I use R over RCall. And that is the output:
RObject{VecSxp}
Principal Components Analysis
Call: principal(r = `#JL`$PCA_Daten, nfactors = `#JL`$PCA_HK)
Standardized loadings (pattern matrix) based upon correlation matrix
RC1 RC2 RC3 RC4 h2 u2 com
1 0.91 -0.13 0.07 0.02 0.85 0.15 1.1
2 -0.34 -0.02 -0.72 -0.13 0.64 0.36 1.5
3 0.75 0.04 0.44 0.23 0.81 0.19 1.9
4 0.35 0.46 -0.03 -0.25 0.39 0.61 2.5
5 0.13 0.00 -0.27 0.80 0.73 0.27 1.3
6 -0.14 0.88 0.03 0.07 0.80 0.20 1.1
7 -0.02 0.88 -0.11 0.09 0.79 0.21 1.1
8 0.78 0.10 -0.44 0.01 0.80 0.20 1.6
9 -0.73 0.01 -0.01 -0.25 0.60 0.40 1.2
10 0.15 0.06 0.27 0.76 0.68 0.32 1.4
11 -0.23 -0.11 0.78 -0.13 0.69 0.31 1.3
RC1 RC2 RC3 RC4
SS loadings 2.87 1.80 1.67 1.44
Proportion Var 0.26 0.16 0.15 0.13
Cumulative Var 0.26 0.42 0.58 0.71
Proportion Explained 0.37 0.23 0.21 0.19
Cumulative Proportion 0.37 0.60 0.81 1.00
Mean item complexity = 1.4
Test of the hypothesis that 4 components are sufficient.
The root mean square of the residuals (RMSR) is 0.09
with the empirical chi square 1473.01 with prob < 3.2e-303
Fit based upon off diagonal values = 0.89
The output is easy to read and contains some information on the quality of model estimation. This is basic information for the data analyst.
The value of the latent variables (scores) is also supplied simply (well, Julia (or the function of the package) also supplies this). And by the way, the scree-graphic is delivered as well.
The effort I have to make as a user is quite small. Speaks from my point of view for the R-function. I still see potential for the MultivariateStats package.
Frankly, I think the outputs from R functions like this are way too verbose - most of the time I want a small fraction of what displayed and have to go hunting. Or I want to pass some data to a different function and the r package thinks the display is the endpoint without giving me easy access to the data. If this kind of display is useful for a Data analyst, maybe make a DataAnalystUtils package that generates nice displays for various models. I’m sure you’re not alone in wanting this, but I wouldn’t want it to be part of the core package.
Edit:
Also would be great to include some plot recipes like this in such a utility package.
I was looking for this exact feature two days ago. I agree that PCA interpretation should get a lot of attention in terms of plots and summary text somewhere.
None of this is very hard to implement though as was pointed out in this thread - overloading Base.show and implementing a Plots.jl recipe would do the trick; if you don’t feel comfortable doing this yourself maybe try opening an issue on the Multivariatestats GitHub, the maintainers might well be up for feature requests!
Interesting thoughts. I can understand that some R functions overwhelm the user with a lot of information. I consider the psych package an exception and it is therefore very popular. However, “hunting” for the right information is often very well supported by the $-operator. Here is an example: The loadings are explicitly output via Model$loadings. Usually very simple.
As far as the development of a package like the DataAnalystUtils is concerned, I am not yet experienced enough. For this I need much more experience. I think the idea is good and maybe you can give me a hint to keep the learning curve low?
I don’t think this is true! Making a package is a great way to learn.
You could check out my MicrobiomePlots.jl for an example of how to make a plotting recipe (I made one for a PCoA scatter here) and others have mentioned how you could make show methods. Technically that would be type piracy I think, but you could just make new displaymodel functions or something.
There are lots of other examples too. Just copy someone and get started
Probably, yes. I mean, IMO the best place for it would really be in the package that creates the type, but there have been several discussions and this is not the generally accepted view. Or, at least it’s not the view of the devs of several of the stats packages who have final say on the matter.
I think a PR adding a more detailed output for PCA in MultivariateStats would be well received (unless it’s really verbose). We already print a lot of information (similar to R) in GLM for example.
StatsPlots sounds like the appropriate place where to define plot recipes for PCA.
Perhaps a nice compromise would be putting various options for details in show with low verbosity as the default (eg. show(..., verbose=false)). Personally, I have historically used the psych package and really liked the Scree plot and read out. I don’t think either of them were especially unique to psychology.
I think if there’s any additional manipulation to the data (e.g. unique projections of components) then a separate package would make sense, but creating additional packages for displaying information that’s already there seems odd to me.