So, a few points and questions…
-
Whether to choose a good subset of variables, or to transform them with a PCA, is a choice that depends on the dataset and the goals of the analysis. I just want to emphasize that PCA is not ranking the original variables, it is finding a new set of synthetic variables (principal components) based on the correlation structure of the original ones. These synthetic variables are by construction uncorrelated, and in order of decreasing variance.
-
That error is pretty self-explanatory. If you are familiar with matrix-vector multiplication, you can figure out why it’s happening. If you aren’t, it may be worth learning a bit more linear algebra before going much deeper into regression modeling or machine learning.
-
In my snippet above, I was simulating an example dataset.
b
was a simulated vector of regression coefficients, which I used to create a simulated y-variable that was a linear combination of the x-variables (i.e., the assumption behind linear regression). If you already have a dataset, you don’t need to do that. -
It appears you are including the y-variable (column 14) in your
X
matrix. Is that what you want? -
I was able to get the dataset, but to repeat myself, please provide a self-contained MWE that we can copy and paste into a REPL. That means loading all the packages, downloading or generating a usable dataset, and reproducing all steps required to reproduce the error or get us up to the sticking point.
Modifying my previous example to use the Boston Housing dataset got me the following. I limited the number of terms per formula to 5; if you’ve got time to wait you could remove that limitation and fit all the possible models.
using CSV, HTTP, DataFrames, Combinatorics, GLM, StatsBase
url = "https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv"
housing = DataFrame(CSV.File(HTTP.get(url).body))
allterms = term.(names(housing))
term_combis = [c for c in combinations(allterms[1:13]) if length(c) <= 5]
formulas = [allterms[14] ~ sum(c) for c in term_combis]
models = [lm(f, housing) for f in formulas]
aics = aic.(models)
comparison = DataFrame(formula = string.(formulas),
nterms = length.(term_combis),
aic = aics)
sort!(comparison, :aic)