[ANN] SIRUS.jl v1.2: Interpretable Machine Learning via Rule Extraction

Just to say, I switched to the Ames housing dataset and tried a bit more. There the regression task is to predict home prices from various variables. I dropped all categorical variables for simplicity, and this time allowed SymbolicRegression to use * and /. I got XGBoost having RMS error in price of ~29k, SymbolicRegression of ~31k, and SIRUS of ~58k, after playing with the parameters a bit. I’m not sure what test/train split they used, but the XGBoost and SymbolicRegression numbers seem competitive which what I found in this old Kaggle contest: Ames Housing Data | Kaggle.

The formula SymbolicRegression ended up with is

(((((((YearBuilt + -0.48597169501749404) * 1.2438029869962868) + GrLivArea) - ((BsmtFinSF1 - (BedroomAbvGr - LotArea)) / (-2.0259896659909704 / 0.7342618541159212))) + GrLivArea) + TotalBsmtSF) / (3.8119719489105597 - YearBuilt))

which if I simplify the parenthesis, I believe is

((YearBuilt + -0.48597169501749404) * 1.2438029869962868 + 
 GrLivArea - 
 ((BsmtFinSF1 - (BedroomAbvGr - LotArea)) / (-2.0259896659909704 / 0.7342618541159212)) +
 GrLivArea +
 TotalBsmtSF)
/ (3.8119719489105597 - YearBuilt)

which seems pretty nice and simple (where here the variable names refer to Z-scored transformations, which was necessary for both SymbolicRegression and SIRUS to perform OK). Sirus ended up with these rules:

StableRules model with 20 rules:
 if X[i, :x2ndFlrSF] ≥ 1.0926578 & X[i, :BsmtFinSF1] ≥ 1.5382147 then 0.028 else 0.009 +
 if X[i, :x2ndFlrSF] ≥ 0.82368124 & X[i, :BsmtFinSF1] ≥ 1.5382147 then 0.037 else 0.012 +
 if X[i, :TotalBsmtSF] < 1.3811567 then -0.018 else 0.423 +
 if X[i, :TotRmsAbvGrd] ≥ 1.5452067 & X[i, :BsmtFinSF1] ≥ 1.5382147 then 0.042 else 0.013 +
 if X[i, :BsmtFinSF1] < 1.5382147 then 0.018 else 0.466 +
 if X[i, :GarageCars] < 0.31470308 then -0.028 else 0.339 +
 if X[i, :TotRmsAbvGrd] ≥ 1.5452067 & X[i, :TotalBsmtSF] ≥ 1.3811567 then 0.049 else 0.016 +
 if X[i, :GrLivArea] ≥ 1.4085871 & X[i, :TotalBsmtSF] ≥ 1.3811567 then 0.075 else 0.023 +
 if X[i, :YrSold] ≥ 0.8789953 & X[i, :GarageCars] ≥ 0.31470308 then 0.021 else 0.006 +
 if X[i, :x1stFlrSF] ≥ 1.43799 & X[i, :x2ndFlrSF] ≥ 1.4581902 then 0.029 else 0.009 +
 if X[i, :GrLivArea] ≥ 1.4085871 & X[i, :BsmtFinSF1] ≥ 1.5382147 then 0.056 else 0.017 +
 if X[i, :GrLivArea] < 0.84535515 then -0.024 else 0.245 +
 if X[i, :YearBuilt] ≥ 1.174533 & X[i, :GrLivArea] ≥ 1.4085871 then 0.033 else 0.013 +
 if X[i, :YearBuilt] ≥ 1.174533 & X[i, :GrLivArea] ≥ 0.84535515 then 0.047 else 0.014 +
 if X[i, :GrLivArea] < 1.4085871 then -0.033 else 0.315 +
 if X[i, :TotRmsAbvGrd] ≥ 1.5452067 & X[i, :GarageYrBlt] ≥ 0.6679906 then 0.061 else 0.015 +
 if X[i, :TotRmsAbvGrd] ≥ 1.5452067 & X[i, :TotalBsmtSF] ≥ 0.9063819 then 0.057 else 0.018 +
 if X[i, :TotRmsAbvGrd] ≥ 1.5452067 & X[i, :YearBuilt] ≥ 0.6748743 then 0.061 else 0.015 +
 if X[i, :x2ndFlrSF] < 1.4581902 then 0.007 else 0.353 +
 if X[i, :TotalBsmtSF] < 0.9063819 then -0.025 else 0.218

While XGBoost gives these feature importances for the 10 most important features:

feature gain weight cover total_gain total_cover
String Float32 Float32 Float32 Float32 Float32
1 “GarageCars” 4.24146f11 10.0 557.4 4.24146f12 5574.0
2 “GrLivArea” 1.41052f10 209.0 287.641 2.94798f12 60117.0
3 “HalfBath” 6.13618f9 14.0 157.0 8.59065f10 2198.0
4 “TotalBsmtSF” 4.58161f9 186.0 225.973 8.5218f11 42031.0
5 “TotRmsAbvGrd” 3.40563f9 48.0 136.646 1.6347f11 6559.0
6 “Fireplaces” 3.15909f9 40.0 228.825 1.26364f11 9153.0
7 “BsmtFinSF1” 3.06287f9 162.0 175.531 4.96186f11 28436.0
8 “YearBuilt” 2.94103f9 202.0 133.02 5.94088f11 26870.0
9 “YearRemodAdd” 2.3613f9 154.0 158.844 3.6364f11 24462.0
10 “FullBath” 2.05867f9 15.0 356.6 3.08801f10 5349.0

Here is my notebook: SIRUS_Symbolic_Regression_Ames.jl (54.1 KB)

9 Likes