Dimensionality Reduction Packages in Julia

ElOceanografo · August 26, 2021, 4:58pm

I don’t know of such a package–what exactly do you mean by “perform the dimension reduction automatically?” MultivariateStats.jl already allows you to set a maximum number of PCs and/or a proportion of variance to explain when fitting a PCA.

Running a regression on all subsets of variables requires a few lines of code, but is also pretty straightforward (the following is based on this discussion):

using DataFrames, StatsBase, GLM, Combinatorics
df = DataFrame(x1 = randn(10), x2 = randn(10), x3 = randn(10), x4 = randn(10))
df[!, :y] = df.x1 .+ 2df.x2 .+ -0.5df.x3 .+ 1.5df.x4 .+ randn.()

terms = term.(names(df))
term_combis = combinations(terms[1:4])
formulas = [terms[end] ~ sum(c) for c in term_combis]                                               
models = [lm(f, df) for f in formulas]
aics = aic.(models)
comparison = DataFrame(formula = formulas, nterms = length.(term_combis), aic = aics)

15×3 DataFrame
 Row │ formula                nterms  aic
     │ FormulaT…              Int64   Float64
─────┼────────────────────────────────────────
   1 │ y ~ x1                      1  46.4727
   2 │ y ~ x2                      1  45.5619
   3 │ y ~ x3                      1  49.3855
   4 │ y ~ x4                      1  47.7685
   5 │ y ~ x1 + x2                 2  44.7786
   6 │ y ~ x1 + x3                 2  48.4727
   7 │ y ~ x1 + x4                 2  47.3097
   8 │ y ~ x2 + x3                 2  47.1286
   9 │ y ~ x2 + x4                 2  36.4842
  10 │ y ~ x3 + x4                 2  49.6269
  11 │ y ~ x1 + x2 + x3            3  46.1428
  12 │ y ~ x1 + x2 + x4            3  36.5456
  13 │ y ~ x1 + x3 + x4            3  49.2632
  14 │ y ~ x2 + x3 + x4            3  37.4955
  15 │ y ~ x1 + x2 + x3 + x4       4  37.2073

“Best” in a situation like this could mean a number of different things, so you will probably have to define what your criteria are. In this case, models 9, 14, and 15 are all pretty close together in terms of AIC, so I wouldn’t feel comfortable declaring one “best” without some more work (cross-validation, model averaging…) depending on what you’re using the models for.

Topic		Replies	Views
PCA Output? Statistics	25	6251	February 17, 2022
Problem realization of PCA Optimization (Mathematical) question	2	554	February 4, 2022
ANN: RegressionTables.jl produces publication-quality regression tables Data announcement	41	7926	June 19, 2020
Custom XGBoost Loss function w/ Zygote. Julia Computing blog post Machine Learning zygote , kaggle	36	4926	April 29, 2020
Efficient way of doing linear regression Performance regression	44	20582	February 7, 2022

Dimensionality Reduction Packages in Julia

Related topics