PCA example from documentation -- not running

question
package

#1

Hello,

I am trying to get dimensionality reduction methods running on a sample dataset and I am constantly running into dependency issues.
I use Julia 0.6.2.

For instance:
http://multivariatestatsjl.readthedocs.io/en/latest/pca.html
does not run.
I tried a DataFrames.DataFrame instead of a DataArray but then I don’t get the scatterplot produced.

Same with Kernel PCA
http://multivariatestatsjl.readthedocs.io/en/latest/kpca.html
Does not work for me…

Could anyone help out?

Thanks.


#2

I’m not familiar with the inner workings of multivariatestats.jl but as a general comment, it’s hard to say what your problem might be from the information given. Could you please post a minimal example of what you’re trying to run and the error you get?


#3

Thanks for the prompt response.
Sure. After adding the package MultivariateStats and creating Xtr, I run

M = fit(KernelPCA, Xtr; maxoutdim=100, inverse=true)

I get
ERROR: UndefVarError: KernelPCA not defined

PCA seems to be working. However, to produce the scatterplot with

p = scatter(setosa[1,:],setosa[2,:],setosa[3,:],marker=:circle,linewidth=0)

does not work on DataFrames (I used a DataFrame instead of DataArray as the latter didn’t run or is outdated).
If someone has the right method at hand for DataFrames, that would be great.


#4

KernelPCA was added to MultivariateStats pretty recently and the developers haven’t tagged a new release since it was added. If you installed MultivariateStats by running

Pkg.add("MultivariateStats")

then you’ll be using the latest release of the package, which does not include KernelPCA. When you go to https://github.com/JuliaStats/MultivariateStats.jl and click on the documentation link at the bottom of the README it brings you to the documentation as it exists on the master branch of the repository. You can run Pkg.checkout("MultivariateStats") to try out the latest features. If you want to go back to the release version just do Pkg.free("MultivariateStats"). Check out the package manager documentation for more information. The following worked for me:

julia> Pkg.checkout("MultivariateStats")
INFO: Checking out MultivariateStats master...
INFO: Pulling MultivariateStats latest master...                                                  
INFO: No packages to install, update or remove                                                    

julia> Xtr = rand(100,10);
using MultivariateStats
INFO: Recompiling stale cache file /home/patrick/.julia/lib/v0.6/MultivariateStats.ji for module MultivariateStats.

julia> M = fit(KernelPCA, Xtr; maxoutdim=100, inverse=true);

I don’t know about your second question unfortunately. To help out someone who might be able to answer it, are you using Plots.jl? In your example is setosa a DataFrame? Can you make a scatter plot if you just pass three plain vectors to scatter?


#5

Thanks for your help.
Regarding the first question: Even if I pull it from the master branch by using checkout I get the same error message…
Regarding my second question:
I am using Plots.jl and scatter is working on arrays only for me now which is fine. So that problem is solved.


#6

Had you already loaded the MultivariateStats packaged into your Julia session when you did Pkg.checkout? If so you should just need to restart Julia and try again. If that doesn’t solve your problem I don’t know what’s going on.

Patrick


#7

Based on the documentation at MultiVariateStats, it looks like PCA is still designed to be performed on an array with each observation as a column. This seems far from a relatively intuitive use in economics, where you have each observation in a row and you want to combine multiple variables (columns) into a few key components.

Is full integrations with DataFrames planned for PCA analysis?

Perhaps you can provide a code snippet that shows how you might want to do this with the MultivariateStats PCA?