Univariate feature selection

I’m having trouble finding a feature selection using Univariate feature selection in Julia. In scikit-learn this is possible as follows:

from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
X, y = load_iris(return_X_y=True)
X.shape                     # (150, 4)
X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
X_new.shape                 # (150, 2)

what would the equivalent in Julia look like?

the only thing i found referring to feature selection in julia was the example below, but it doesn’t do what i’m looking for.

using MLJ

X = (ordinal1 = [1, 2, 3],
     ordinal2 = coerce(["x", "y", "x"], OrderedFactor),
     ordinal3 = [10.0, 20.0, 30.0],
     ordinal4 = [-20.0, -30.0, -40.0],
     nominal = coerce(["Your father", "he", "is"], Multiclass));

selector = FeatureSelector(features=[:ordinal3, ], ignore=true);

julia> transform(fit!(machine(selector, X)), X)
(ordinal1 = [1, 2, 3],
 ordinal2 = CategoricalValue{Symbol,UInt32}["x", "y", "x"],
 ordinal4 = [-20.0, -30.0, -40.0],
 nominal = CategoricalValue{String,UInt32}["Your father", "he", "is"],)
1 Like

Thanks @lucasmsoares96 for giving MLJ a spin.

MLJ does not currently have much in the way of canned feature selection tools. However, you can provide the MLJ wrapper TunedModel with any list of models, and training that will pick the best model based on specified resampling strategy (eg, CV), and then train on all data.

As you probably realize, FeatureSelector just cuts your table down using user-specified features.

So here’s a demo of what I think you are looking for:

using MLJ
using Combinatorics
using Tables

X, y = @load_iris # table, vector
KNN = @iload KNNClassifier
knn = KNN()

features = Tables.columnnames(X)
selections = combinations(features, 2)

models = map(selections) do s
    FeatureSelector(features=s) |> knn

tmodel = TunedModel(models=models, resampling=CV(nfolds=4, rng=123), measure=log_loss)

# Training `tmodel` means choosing the model in `models` with the best cv score, and then
# retraining best model on all data

mach = machine(tmodel, X, y)
predict(mach, X)[1:2] # this prediction based on best model trained on all data

# You can also inspect the best model:
r = report(mach).best_model

julia> r.feature_selector.features
2-element Vector{Symbol}:

P.S. Maybe you want to change the title of your post. I had to scratch my head a bit. Maybe CV-based feature selection would be more informative??