Just to say, I switched to the Ames housing dataset and tried a bit more. There the regression task is to predict home prices from various variables. I dropped all categorical variables for simplicity, and this time allowed SymbolicRegression to use *
and /
. I got XGBoost having RMS error in price of ~29k, SymbolicRegression of ~31k, and SIRUS of ~58k, after playing with the parameters a bit. I’m not sure what test/train split they used, but the XGBoost and SymbolicRegression numbers seem competitive which what I found in this old Kaggle contest: Ames Housing Data | Kaggle.
The formula SymbolicRegression ended up with is
(((((((YearBuilt + -0.48597169501749404) * 1.2438029869962868) + GrLivArea) - ((BsmtFinSF1 - (BedroomAbvGr - LotArea)) / (-2.0259896659909704 / 0.7342618541159212))) + GrLivArea) + TotalBsmtSF) / (3.8119719489105597 - YearBuilt))
which if I simplify the parenthesis, I believe is
((YearBuilt + -0.48597169501749404) * 1.2438029869962868 +
GrLivArea -
((BsmtFinSF1 - (BedroomAbvGr - LotArea)) / (-2.0259896659909704 / 0.7342618541159212)) +
GrLivArea +
TotalBsmtSF)
/ (3.8119719489105597 - YearBuilt)
which seems pretty nice and simple (where here the variable names refer to Z-scored transformations, which was necessary for both SymbolicRegression and SIRUS to perform OK). Sirus ended up with these rules:
StableRules model with 20 rules:
if X[i, :x2ndFlrSF] ≥ 1.0926578 & X[i, :BsmtFinSF1] ≥ 1.5382147 then 0.028 else 0.009 +
if X[i, :x2ndFlrSF] ≥ 0.82368124 & X[i, :BsmtFinSF1] ≥ 1.5382147 then 0.037 else 0.012 +
if X[i, :TotalBsmtSF] < 1.3811567 then -0.018 else 0.423 +
if X[i, :TotRmsAbvGrd] ≥ 1.5452067 & X[i, :BsmtFinSF1] ≥ 1.5382147 then 0.042 else 0.013 +
if X[i, :BsmtFinSF1] < 1.5382147 then 0.018 else 0.466 +
if X[i, :GarageCars] < 0.31470308 then -0.028 else 0.339 +
if X[i, :TotRmsAbvGrd] ≥ 1.5452067 & X[i, :TotalBsmtSF] ≥ 1.3811567 then 0.049 else 0.016 +
if X[i, :GrLivArea] ≥ 1.4085871 & X[i, :TotalBsmtSF] ≥ 1.3811567 then 0.075 else 0.023 +
if X[i, :YrSold] ≥ 0.8789953 & X[i, :GarageCars] ≥ 0.31470308 then 0.021 else 0.006 +
if X[i, :x1stFlrSF] ≥ 1.43799 & X[i, :x2ndFlrSF] ≥ 1.4581902 then 0.029 else 0.009 +
if X[i, :GrLivArea] ≥ 1.4085871 & X[i, :BsmtFinSF1] ≥ 1.5382147 then 0.056 else 0.017 +
if X[i, :GrLivArea] < 0.84535515 then -0.024 else 0.245 +
if X[i, :YearBuilt] ≥ 1.174533 & X[i, :GrLivArea] ≥ 1.4085871 then 0.033 else 0.013 +
if X[i, :YearBuilt] ≥ 1.174533 & X[i, :GrLivArea] ≥ 0.84535515 then 0.047 else 0.014 +
if X[i, :GrLivArea] < 1.4085871 then -0.033 else 0.315 +
if X[i, :TotRmsAbvGrd] ≥ 1.5452067 & X[i, :GarageYrBlt] ≥ 0.6679906 then 0.061 else 0.015 +
if X[i, :TotRmsAbvGrd] ≥ 1.5452067 & X[i, :TotalBsmtSF] ≥ 0.9063819 then 0.057 else 0.018 +
if X[i, :TotRmsAbvGrd] ≥ 1.5452067 & X[i, :YearBuilt] ≥ 0.6748743 then 0.061 else 0.015 +
if X[i, :x2ndFlrSF] < 1.4581902 then 0.007 else 0.353 +
if X[i, :TotalBsmtSF] < 0.9063819 then -0.025 else 0.218
While XGBoost gives these feature importances for the 10 most important features:
feature | gain | weight | cover | total_gain | total_cover | |
---|---|---|---|---|---|---|
String | Float32 | Float32 | Float32 | Float32 | Float32 | |
1 | “GarageCars” | 4.24146f11 | 10.0 | 557.4 | 4.24146f12 | 5574.0 |
2 | “GrLivArea” | 1.41052f10 | 209.0 | 287.641 | 2.94798f12 | 60117.0 |
3 | “HalfBath” | 6.13618f9 | 14.0 | 157.0 | 8.59065f10 | 2198.0 |
4 | “TotalBsmtSF” | 4.58161f9 | 186.0 | 225.973 | 8.5218f11 | 42031.0 |
5 | “TotRmsAbvGrd” | 3.40563f9 | 48.0 | 136.646 | 1.6347f11 | 6559.0 |
6 | “Fireplaces” | 3.15909f9 | 40.0 | 228.825 | 1.26364f11 | 9153.0 |
7 | “BsmtFinSF1” | 3.06287f9 | 162.0 | 175.531 | 4.96186f11 | 28436.0 |
8 | “YearBuilt” | 2.94103f9 | 202.0 | 133.02 | 5.94088f11 | 26870.0 |
9 | “YearRemodAdd” | 2.3613f9 | 154.0 | 158.844 | 3.6364f11 | 24462.0 |
10 | “FullBath” | 2.05867f9 | 15.0 | 356.6 | 3.08801f10 | 5349.0 |
Here is my notebook: SIRUS_Symbolic_Regression_Ames.jl (54.1 KB)