Dimensionality Reduction Packages in Julia

I don’t know of such a package–what exactly do you mean by “perform the dimension reduction automatically?” MultivariateStats.jl already allows you to set a maximum number of PCs and/or a proportion of variance to explain when fitting a PCA.

Running a regression on all subsets of variables requires a few lines of code, but is also pretty straightforward (the following is based on this discussion):

using DataFrames, StatsBase, GLM, Combinatorics
df = DataFrame(x1 = randn(10), x2 = randn(10), x3 = randn(10), x4 = randn(10))
df[!, :y] = df.x1 .+ 2df.x2 .+ -0.5df.x3 .+ 1.5df.x4 .+ randn.()

terms = term.(names(df))
term_combis = combinations(terms[1:4])
formulas = [terms[end] ~ sum(c) for c in term_combis]                                               
models = [lm(f, df) for f in formulas]
aics = aic.(models)
comparison = DataFrame(formula = formulas, nterms = length.(term_combis), aic = aics)

15×3 DataFrame
 Row │ formula                nterms  aic
     │ FormulaT…              Int64   Float64
─────┼────────────────────────────────────────
   1 │ y ~ x1                      1  46.4727
   2 │ y ~ x2                      1  45.5619
   3 │ y ~ x3                      1  49.3855
   4 │ y ~ x4                      1  47.7685
   5 │ y ~ x1 + x2                 2  44.7786
   6 │ y ~ x1 + x3                 2  48.4727
   7 │ y ~ x1 + x4                 2  47.3097
   8 │ y ~ x2 + x3                 2  47.1286
   9 │ y ~ x2 + x4                 2  36.4842
  10 │ y ~ x3 + x4                 2  49.6269
  11 │ y ~ x1 + x2 + x3            3  46.1428
  12 │ y ~ x1 + x2 + x4            3  36.5456
  13 │ y ~ x1 + x3 + x4            3  49.2632
  14 │ y ~ x2 + x3 + x4            3  37.4955
  15 │ y ~ x1 + x2 + x3 + x4       4  37.2073

“Best” in a situation like this could mean a number of different things, so you will probably have to define what your criteria are. In this case, models 9, 14, and 15 are all pretty close together in terms of AIC, so I wouldn’t feel comfortable declaring one “best” without some more work (cross-validation, model averaging…) depending on what you’re using the models for.

5 Likes