Dimensionality Reduction Packages in Julia

ElOceanografo · September 7, 2021, 8:59pm

So, a few points and questions…

Whether to choose a good subset of variables, or to transform them with a PCA, is a choice that depends on the dataset and the goals of the analysis. I just want to emphasize that PCA is not ranking the original variables, it is finding a new set of synthetic variables (principal components) based on the correlation structure of the original ones. These synthetic variables are by construction uncorrelated, and in order of decreasing variance.
That error is pretty self-explanatory. If you are familiar with matrix-vector multiplication, you can figure out why it’s happening. If you aren’t, it may be worth learning a bit more linear algebra before going much deeper into regression modeling or machine learning.
In my snippet above, I was simulating an example dataset. b was a simulated vector of regression coefficients, which I used to create a simulated y-variable that was a linear combination of the x-variables (i.e., the assumption behind linear regression). If you already have a dataset, you don’t need to do that.
It appears you are including the y-variable (column 14) in your X matrix. Is that what you want?
I was able to get the dataset, but to repeat myself, please provide a self-contained MWE that we can copy and paste into a REPL. That means loading all the packages, downloading or generating a usable dataset, and reproducing all steps required to reproduce the error or get us up to the sticking point.

Modifying my previous example to use the Boston Housing dataset got me the following. I limited the number of terms per formula to 5; if you’ve got time to wait you could remove that limitation and fit all the possible models.

using CSV, HTTP, DataFrames, Combinatorics, GLM, StatsBase

url = "https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv"
housing = DataFrame(CSV.File(HTTP.get(url).body))

allterms = term.(names(housing))
term_combis = [c for c in combinations(allterms[1:13]) if length(c) <= 5]
formulas = [allterms[14] ~ sum(c) for c in term_combis]

models = [lm(f, housing) for f in formulas]
aics = aic.(models)
comparison = DataFrame(formula = string.(formulas), 
    nterms = length.(term_combis), 
    aic = aics)
sort!(comparison, :aic)

YummyPampers2 · September 7, 2021, 10:07pm

Hello Sam:

Thank you again! Might you be able to point me to
some resources that align with the Linear Algebra
component of our thread here? Or general resources
that are practical and not super esoteric.

Much appreciated,

ElOceanografo · September 10, 2021, 6:05pm

I don’t have any particular recommendations; you could try Khan Academy or similar tutorials. The important concepts to start with would be vectors, matrices, matrix multiplication, and the way that the columns of a matrix are a “basis” for a multidimensional space.

zdenek_hurak · September 10, 2021, 7:11pm

I would recommend Steven Boyd’s Introduction to Applied Linear Algebra, which is not only nicely written but also legally available in electronic form on the author’s website at http://vmls-book.stanford.edu/.

I would then follow with Data Driven Science and Engineering – Machine Learning, Dynamical Systems and Control by Steve Brunton and Nathan Kutz. Unfortunately, PDF of the book is not available on the book website http://www.databookuw.com/ (well, you can always buy the book), but a whole lot of videos are linked there. In particular, a (sub)section on Dimensionality reduction using SVD (including a careful and accessible intro to PCA) is at Chapter 1: Singular Value Decomposition | DATA DRIVEN SCIENCE & ENGINEERING. Check it out.

EvoArt · September 10, 2021, 9:09pm

Wow, the vmls book even has a Julia companion http://vmls-book.stanford.edu/vmls-julia-companion.pdf

dlakelan · September 11, 2021, 6:04am

In addition the the PCA method, you might also look at UMAP

https://github.com/dillondaudert/UMAP.jl

dave.f.kleinschmidt · October 8, 2021, 3:32pm

This is a great example of a use of programmatic formula construction!! Cool to see it popping up in the wild

YummyPampers2 · October 24, 2021, 8:19pm

Hi Sam,

Sorry for the extended delay. Did not get your message in my email alerts. But appreciate the suggestion. And the convo from before.

Topic		Replies	Views
Problem realization of PCA Optimization (Mathematical) question	2	554	February 4, 2022
PCA example from documentation -- not running General Usage question , package	6	2364	February 19, 2018
[ANN] FastRandPCA - Fast Randomized PCA for Sparse Data Package Announcements package , statistics , matrices	7	889	December 3, 2022
Is there any julia package for multivariate polynomial regression? Machine Learning question , package	18	1212	January 16, 2024
Multivariate Linear Regression in the Julia Machine Learning	1	752	November 23, 2018

Dimensionality Reduction Packages in Julia

Related topics