Dimensionality Reduction Packages in Julia

So, a few points and questions…

  • Whether to choose a good subset of variables, or to transform them with a PCA, is a choice that depends on the dataset and the goals of the analysis. I just want to emphasize that PCA is not ranking the original variables, it is finding a new set of synthetic variables (principal components) based on the correlation structure of the original ones. These synthetic variables are by construction uncorrelated, and in order of decreasing variance.

  • That error is pretty self-explanatory. If you are familiar with matrix-vector multiplication, you can figure out why it’s happening. If you aren’t, it may be worth learning a bit more linear algebra before going much deeper into regression modeling or machine learning.

  • In my snippet above, I was simulating an example dataset. b was a simulated vector of regression coefficients, which I used to create a simulated y-variable that was a linear combination of the x-variables (i.e., the assumption behind linear regression). If you already have a dataset, you don’t need to do that.

  • It appears you are including the y-variable (column 14) in your X matrix. Is that what you want?

  • I was able to get the dataset, but to repeat myself, please provide a self-contained MWE that we can copy and paste into a REPL. That means loading all the packages, downloading or generating a usable dataset, and reproducing all steps required to reproduce the error or get us up to the sticking point.

Modifying my previous example to use the Boston Housing dataset got me the following. I limited the number of terms per formula to 5; if you’ve got time to wait you could remove that limitation and fit all the possible models.

using CSV, HTTP, DataFrames, Combinatorics, GLM, StatsBase

url = "https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv"
housing = DataFrame(CSV.File(HTTP.get(url).body))

allterms = term.(names(housing))
term_combis = [c for c in combinations(allterms[1:13]) if length(c) <= 5]
formulas = [allterms[14] ~ sum(c) for c in term_combis]

models = [lm(f, housing) for f in formulas]
aics = aic.(models)
comparison = DataFrame(formula = string.(formulas), 
    nterms = length.(term_combis), 
    aic = aics)
sort!(comparison, :aic)
4 Likes

Hello Sam:

Thank you again! Might you be able to point me to
some resources that align with the Linear Algebra
component of our thread here? Or general resources
that are practical and not super esoteric.

Much appreciated,

I don’t have any particular recommendations; you could try Khan Academy or similar tutorials. The important concepts to start with would be vectors, matrices, matrix multiplication, and the way that the columns of a matrix are a “basis” for a multidimensional space.

1 Like

I would recommend Steven Boyd’s Introduction to Applied Linear Algebra, which is not only nicely written but also legally available in electronic form on the author’s website at http://vmls-book.stanford.edu/.

I would then follow with Data Driven Science and Engineering – Machine Learning, Dynamical Systems and Control by Steve Brunton and Nathan Kutz. Unfortunately, PDF of the book is not available on the book website http://www.databookuw.com/ (well, you can always buy the book), but a whole lot of videos are linked there. In particular, a (sub)section on Dimensionality reduction using SVD (including a careful and accessible intro to PCA) is at Chapter 1: Singular Value Decomposition | DATA DRIVEN SCIENCE & ENGINEERING. Check it out.

1 Like

Wow, the vmls book even has a Julia companion http://vmls-book.stanford.edu/vmls-julia-companion.pdf

1 Like

In addition the the PCA method, you might also look at UMAP

https://github.com/dillondaudert/UMAP.jl

1 Like

This is a great example of a use of programmatic formula construction!! Cool to see it popping up in the wild :slight_smile:

2 Likes

Hi Sam,

Sorry for the extended delay. Did not get your message in my email alerts. But appreciate the suggestion. And the convo from before.