Dimensionality Reduction Packages in Julia

YummyPampers2 · August 25, 2021, 7:17pm

Hello Everyone:

Are there any practical examples of PCA implementation that work with Julia?
YES – I am familiar and have attempted to implement the instructions listed HERE.

My goal however is to use a package that can perform the dimension
reduction automatically or provide a table of optimal features to use for
future analysis.

Are there any packages out there that can display the best subset regression for
X-values (independent) v. Y-value (dependent/outcome variable)?

ElOceanografo · August 26, 2021, 4:58pm

I don’t know of such a package–what exactly do you mean by “perform the dimension reduction automatically?” MultivariateStats.jl already allows you to set a maximum number of PCs and/or a proportion of variance to explain when fitting a PCA.

Running a regression on all subsets of variables requires a few lines of code, but is also pretty straightforward (the following is based on this discussion):

using DataFrames, StatsBase, GLM, Combinatorics
df = DataFrame(x1 = randn(10), x2 = randn(10), x3 = randn(10), x4 = randn(10))
df[!, :y] = df.x1 .+ 2df.x2 .+ -0.5df.x3 .+ 1.5df.x4 .+ randn.()

terms = term.(names(df))
term_combis = combinations(terms[1:4])
formulas = [terms[end] ~ sum(c) for c in term_combis]                                               
models = [lm(f, df) for f in formulas]
aics = aic.(models)
comparison = DataFrame(formula = formulas, nterms = length.(term_combis), aic = aics)

15×3 DataFrame
 Row │ formula                nterms  aic
     │ FormulaT…              Int64   Float64
─────┼────────────────────────────────────────
   1 │ y ~ x1                      1  46.4727
   2 │ y ~ x2                      1  45.5619
   3 │ y ~ x3                      1  49.3855
   4 │ y ~ x4                      1  47.7685
   5 │ y ~ x1 + x2                 2  44.7786
   6 │ y ~ x1 + x3                 2  48.4727
   7 │ y ~ x1 + x4                 2  47.3097
   8 │ y ~ x2 + x3                 2  47.1286
   9 │ y ~ x2 + x4                 2  36.4842
  10 │ y ~ x3 + x4                 2  49.6269
  11 │ y ~ x1 + x2 + x3            3  46.1428
  12 │ y ~ x1 + x2 + x4            3  36.5456
  13 │ y ~ x1 + x3 + x4            3  49.2632
  14 │ y ~ x2 + x3 + x4            3  37.4955
  15 │ y ~ x1 + x2 + x3 + x4       4  37.2073

“Best” in a situation like this could mean a number of different things, so you will probably have to define what your criteria are. In this case, models 9, 14, and 15 are all pretty close together in terms of AIC, so I wouldn’t feel comfortable declaring one “best” without some more work (cross-validation, model averaging…) depending on what you’re using the models for.

ElOceanografo · August 27, 2021, 2:22pm

I’m not familiar with GlobalSearchRegression.jl, but it looks like it’s doing basically the same thing. I’m not exactly sure what you mean by “which labels to use”…do you mean just what to use for axis/color labels? If you’re using PCA to reduce a large set of correlated explanatory variables, just call them “PC 1, PC 2” etc. in order of their variance explained. The PCA procedure in MultivariateStats.jl will always return them in order.

YummyPampers2 · August 27, 2021, 8:21pm

Regarding the label question. Take for example –
You have a data frame sized as (500,14)

First: Split half of the data to Training Set

Xtr = convert(Matrix, DF[ :, 1:13])'
Xtr_labels = convert(Vector, DF[:, 14])

Split half the data to test set

Xte = convert(Matrix, DF[2:2:end, 1:13])'
Xte_labels = convert(Vector, DF[2:end, 14])

Train a PCA model allowing 3 dimensions

M = fit(PCA, Xtr, maxoutdim = 3)
#Transform 
Yte = transform(M, Xte)

Approximately reconstruct test observations

Xr = reconstruct(M, Yte)

From here if you wanted to assign labels of
interest (that might occupy random column position
from Yte (i.e., Yte[:, 2], Yte[:, 5], Yte[:, 8]) using
something like:

	Cookies = Yte[:,Xte_labels.=="Cookies"]
        Cereals = Yte[:,Xte_labels.=="Cereals"]
	Pandas = Yte[:,Xte_labels.=="Pandas"]

How would you avoid an error than might
occur that reads:

BoundsError: attempt to access 3×253 Matrix{Float64} at index [1:3, Bool[0,0,...]]
# With stacktrace reading
throw_boundserror(::Matrix{Float64}, ::Tuple{Base.Slice{Base.OneTo{Int64}}, Base.LogicalIndex{Int64, BitVector}})@abstractarray.jl:651
checkbounds@abstractarray.jl:616[inlined]
_getindex(::IndexLinear, ::Matrix{Float64}, ::Base.Slice{Base.OneTo{Int64}}, ::Base.LogicalIndex{Int64, BitVector})@multidimensional.jl:831
getindex@abstractarray.jl:1170[inlined]

AND directly related to your suggestion to assign
“PC1, PC2…” label, how would you avoid a blank
plot that might result from the instructions below:

p = stp.scatter(Cookies[1,:], Cookies[2,:], Cookies[3,:] ,marker=:circle,linewidth=0)
	stp.scatter!(Cereals[1,:], Cereals[2,:], Cereals[3,:] ,marker=:circle,linewidth=0)
	stp.scatter!(Pandas[1,:], Pandas[2,:], Pandas[3,:] ,marker=:circle,linewidth=0)
		plot!(p,
		      xlabel="PC1", 
		      ylabel="PC2", 
		      zlabel="PC3", 
	              background= :grey,  
                      color= :blue)

YummyPampers2 · August 27, 2021, 8:39pm

For your presentation here, how
did you decide on the coefficients
from the line:

df[!, :y] = df.x1 .+ 2df.x2 .+ -0.5df.x3 .+ 1.5df.x4 .+ randn.()

as (1, 2, -0.5, 1.5 …)

Thank you,

YummyPampers2 · August 30, 2021, 11:44pm

@ElOceanografo

And

models = [lm(f, df) for f in formulas]

Is taking quite a long time to execute. Did
you have similar issue? Might you know
some steps to address this problem?

ElOceanografo · September 1, 2021, 3:17pm

I’m still not totally clear what you’re trying to accomplish, but it looks like you might have the rows and columns mixed up, which is what is causing the out-of-bounds error?

In your plotting question, what is stp? This would be easier for me to help with if you could supply a MWE…

ElOceanografo · September 1, 2021, 3:21pm

Are you running it on my little example, or on a larger dataset of your own? “All combinations” of a set of variables grows, well, combinatorically, so you it’s possible the length of the list of models could get very long. You can limit the number of terms to include with the second argument to combinations.

ElOceanografo · September 1, 2021, 3:21pm

I just made these up for the sake of the example.

YummyPampers2 · September 1, 2021, 7:29pm

stp is the alias I used for StatsPlots.jl.

YummyPampers2 · September 1, 2021, 7:35pm

I am running your example on a dataframe sized (506,14) with
all float fields and the response variable in column 14.

You are saying I could use:

combo1 = combinations(terms[1:4])
combo2 = combinations(terms[5:8])
combo3 = ...

Then run:

comparison = DataFrame( ... length.(combo1), ..)
comparison2 = DataFrame(... length.(combo2), ..)
...

YummyPampers2 · September 1, 2021, 7:40pm

Might you be able to help me trouble shoot the row and column issue?
As I said, the DataFrame size is (506,14), but I can learn from a MWE.

What I am attempting to do was a PCA analysis. However, before your
previous reply, was not aware that the components are always analyzed
from left to right in order. Because I did not know this, I was looking for
the subset regression analysis packages to provide a more informative
report of how the explanatory variables interact with outcome variables.

sylvaticus · September 1, 2021, 8:04pm

Just to add to the list, the pca function in the BetaML package allows to perform pca by choosing either the number of dimensions in output or the minimum share of the original variance to retrieve.

YummyPampers2 · September 2, 2021, 5:26am

Thank you for sharing.

Do you know why the parameter X is explained as (presented in the order of descending variance for columnar values) yet the MWE is not displayed as such? Is there a different way to interpret their MWE here:

sylvaticus · September 2, 2021, 5:59am

Thanks . I am going to have a look in a few hours… (first day of school here in France)…

sylvaticus · September 2, 2021, 12:49pm

You are correct. There was a bug with the documentation explaining that the order was from the most to the less explained variance, while the code was doing the opposite.

I did prefer changing the code, so there should be soon (a few minutes) a new version v0.5.4 with the code changed to agree with the documentation, e.g.:

julia> using BetaML
julia> X = [1 8; 4.5 5.5; 9.5 0.5]
3×2 Matrix{Float64}:
 1.0  8.0
 4.5  5.5
 9.5  0.5
julia> out = pca(X;K=2);
julia> out.X
3×2 Matrix{Float64}:
 -4.58465   6.63182
 -0.308999  7.09961
  6.75092   6.70262
julia> out.explVarByDim
2-element Vector{Float64}:
 0.9980637093577441
 1.0
julia> out.P
2×2 Matrix{Float64}:
  0.745691  0.666292
 -0.666292  0.745691

ElOceanografo · September 2, 2021, 1:54pm

Ok, that may explain it. The number of ways to choose k unique terms from a set of n is given by the binomial coefficient (the formula at the top of this article: Combination - Wikipedia). In Julia, that’s number is given by the binomial(n, k) function. Because you’re doing all possible subsets of 14 variables (i.e., with k ranging from 1 to 14), the total number of models you’re trying to fit is

julia> sum(binomial(14, k) for k in 1:14)
16383

so I’d expect it to take a little while.

If you look at the documentation for combinations, you can limit it to only forming combinations of a certain length or less (putting an upper limit on k, in the notation above). For instance, with 14 terms, doing

combo = combinations(terms, 4)

would limit the resulting combinations to 4 or fewer terms, and the total number of models you’d have to fit would go from 16,383 to 1,470.

However, I think there may be a more fundamental misunderstanding here of PCA and dimensionality reduction. If you have a large set of correlated variables, a PCA will find a linear function that transforms them to a set of uncorrelated variables that contain the same information in a more parsimonious format. Typically, you can choose a small number of principal components that capture most of the information/variability in the full dataset. People do PCA before regression modeling because linear regressions can be hard to interpret when they have a lot of correlated explanatory variables. But critically, the transformed variables that come out of a PCA are already independent and uncorrelated, so you don’t need to do any variable selection on them when building your regression. The first k PCs are already the k best to use in your model. Here’s the gist of how you’d use PCA for dimensionality reduction prior to regression:

using MultivariateStats, GLM
X = randn(512, 14)  # Simulate 14 explanatory variables
b = randn(14) # Simulated regression coefficients
y = X * b .+ randn.() # Simulated variable to predict
# Find a reduced set of uncorrelated variables using PCA
m = fit(PCA, X', maxoutdim=4)
T = MultivariateStats.transform(m, X') 
# Put them in a DataFrame and do the regression
df = DataFrame(T', [:T1, :T2, :T3, :T4])
lm(@formula(y ~ T1 + T2 + T3 + T4), df)

I’m still not totally clear on what task you are trying to do. What are your data, what are you trying to predict, what are cookies, cereals, and pandas supposed to represent…a self-contained MWE that we can copy and paste into a REPL would be really helpful.

YummyPampers2 · September 2, 2021, 3:20pm

Good Day Sam – I really appreciate this explanation. Give me a little time and I will share a MWE.

YummyPampers2 · September 2, 2021, 3:22pm

Thank you very much Antonello. School starting here too – I think the overlords all decided that school starts on Sept. 1 or 2nd.

I will implement this and provide some additionally feedback shortly.

YummyPampers2 · September 5, 2021, 9:57pm

Hi Sam:

I took table with 14 attributes. Columns
1:13 are the explanatory variables. Column
14 is the outcome variable. I want to select
a best subset of the variables in the given
table. In other words: perform all subset
regressions on a set of potential covariates,
in order to select relevant features.

Yes, there are different criteria to define ‘best’.
In this case, let us use, :r2adj, :aic, & :cp to
cross-check.

I prefer your method or GlobalSearchRegression.jl
over P.C.A. because the results maintain the labels
from the original dataframe and do not require what
seems like an a priori arrangement of the attributes
in ranking order of variance (as does P.C.A.).

Minimum Workable Example:

Let us use: ‘Boston Housing Dataset’ HERE

Using your example above (after rearranging
the columns so :medv is the last column):

X = Matrix(HouseDF[:, 1:14])

then,

b = Matrix(HouseDF[:,14:end])

running the following:

y = X * b .+ randn.()

yields the following error:

y

DimensionMismatch("A has dimensions (506,14) but B has dimensions (506,1)")

    gemm_wrapper!(::Matrix{Float64}, ::Char, ::Char, ::Matrix{Float64}, ::Matrix{Float64}, ::LinearAlgebra.MulAddMul{true, true, Bool, Bool})@matmul.jl:643
    mul!@matmul.jl:169[inlined]
    mul!@matmul.jl:275[inlined]
    *@matmul.jl:160[inlined]
    top-level scope@Local: 1[inlined]

Topic		Replies	Views
PCA Output? Statistics	25	6257	February 17, 2022
Problem realization of PCA Optimization (Mathematical) question	2	555	February 4, 2022
ANN: RegressionTables.jl produces publication-quality regression tables Data announcement	41	7927	June 19, 2020
Custom XGBoost Loss function w/ Zygote. Julia Computing blog post Machine Learning zygote , kaggle	36	4926	April 29, 2020
Efficient way of doing linear regression Performance regression	44	20606	February 7, 2022

Dimensionality Reduction Packages in Julia

Related topics