Univariate feature selection

lucasmsoares96 · September 17, 2022, 6:22pm

I’m having trouble finding a feature selection using Univariate feature selection in Julia. In scikit-learn this is possible as follows:

from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
X, y = load_iris(return_X_y=True)
X.shape                     # (150, 4)
X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
X_new.shape                 # (150, 2)

what would the equivalent in Julia look like?

the only thing i found referring to feature selection in julia was the example below, but it doesn’t do what i’m looking for.

using MLJ

X = (ordinal1 = [1, 2, 3],
     ordinal2 = coerce(["x", "y", "x"], OrderedFactor),
     ordinal3 = [10.0, 20.0, 30.0],
     ordinal4 = [-20.0, -30.0, -40.0],
     nominal = coerce(["Your father", "he", "is"], Multiclass));

selector = FeatureSelector(features=[:ordinal3, ], ignore=true);

julia> transform(fit!(machine(selector, X)), X)
(ordinal1 = [1, 2, 3],
 ordinal2 = CategoricalValue{Symbol,UInt32}["x", "y", "x"],
 ordinal4 = [-20.0, -30.0, -40.0],
 nominal = CategoricalValue{String,UInt32}["Your father", "he", "is"],)

ablaom · September 18, 2022, 9:07pm

Thanks @lucasmsoares96 for giving MLJ a spin.

MLJ does not currently have much in the way of canned feature selection tools. However, you can provide the MLJ wrapper TunedModel with any list of models, and training that will pick the best model based on specified resampling strategy (eg, CV), and then train on all data.

As you probably realize, FeatureSelector just cuts your table down using user-specified features.

So here’s a demo of what I think you are looking for:

using MLJ
using Combinatorics
using Tables

X, y = @load_iris # table, vector
KNN = @iload KNNClassifier
knn = KNN()

features = Tables.columnnames(X)
selections = combinations(features, 2)

models = map(selections) do s
    FeatureSelector(features=s) |> knn
end

tmodel = TunedModel(models=models, resampling=CV(nfolds=4, rng=123), measure=log_loss)

# Training `tmodel` means choosing the model in `models` with the best cv score, and then
# retraining best model on all data

mach = machine(tmodel, X, y)
fit!(mach)
predict(mach, X)[1:2] # this prediction based on best model trained on all data

# You can also inspect the best model:
r = report(mach).best_model

julia> r.feature_selector.features
2-element Vector{Symbol}:
 :petal_length
 :petal_width

P.S. Maybe you want to change the title of your post. I had to scratch my head a bit. Maybe CV-based feature selection would be more informative??

Topic		Replies	Views
Feature selection+classification pipeline Machine Learning	3	659	June 7, 2022
FeatureSelector in MLJ Learning Networks General Usage question	4	478	October 8, 2019
optimal feature subset selection Optimization (Mathematical)	0	545	July 30, 2018
MLJ - A machine learning toolbox for Julia Package Announcements	0	2212	April 30, 2019
Applying Machine learning models to Diabetes dataset using MLJ.jl packages General Usage	2	556	July 4, 2021

Univariate feature selection

Related topics