Feature selection+classification pipeline

Hi,
I am relatively new to Julia.
I find it very difficult to build a pipeline that selects optimal subset of features from a large number of features and then training a ML classifier.
Currently, I first apply ShapML to rank features by importance, then use TunedModel to find optimal number of the ranked features to include in model, than train ML, but this seems very suboptimal. I tried Relieff but was not able to run it due to dependency issue. Unfortunately other common feature selection algorithms are not available in Julia (e.g. the recursive).
I was wondering if a pipeline could be designed and trained where some features are randomly chosen, then given to ML in order to provide optimal subset.

I would be happy to share code if needed

Thanks in advance

Adding support for recursive feature elimination, and other feature selection tools, is currently high priority in MLJ.

I was wondering if a pipeline could be designed and trained where some features are randomly chosen, then given to ML in order to provide optimal subset.

One approach would be to use TunedModel(models=..., ) (see here) and manually generate models as mutations of the pipeline model pipe=FeatureSelector |> your_model (mutating pipe.feature_selector.features). I can write out an example later next week if you get stuck. With a bit more work one could translate this into a general re-usable model wrapper, using learning networks but that would be a bit more advanced.

1 Like

thanks, ablaom, you gave me a lot of ideas to play with! Iā€™ll let you know how it goes

best

here we are! works perfectly, now Iā€™ll try to do things sequentially, e.g. first find best model with one features, than a adding feature in next iteration
thanks again



using Symbolics

feature_names=Symbolics.scalarize(schema(X).names |> collect)

model_names=Vector{String}()
package_names=Vector{String}()

modelsxx=[]

numfeatures=5
nummontecarlo=50
feature_namesred=[]
[feature_names[1:2];feature_names[1:2]]
for m in [models(matching(X1, y1))[1:20];(name="NeuralNetworkClassifier",package_name="MLJFlux")]
    if m.nameā‰ "LinearBinaryClassifier"&&m.nameā‰ "LDA"&&m.nameā‰ "GaussianNBClassifier"&&m.nameā‰ "BayesianLDA"&&m.nameā‰ "DSADDetector"&&m.nameā‰ "ESADDetector"&&m.nameā‰ "GaussianProcessClassifier"&&m.nameā‰ "PerceptronClassifier"&&m.nameā‰ "RidgeCVClassifier"&&m.nameā‰ "RidgeClassifier"&&m.nameā‰ "SVC"&&m.nameā‰ "SVMClassifier"&&m.nameā‰ "SVMLinearClassifier"&&m.nameā‰ "SVMNuClassifier"&&m.nameā‰ "SGDClassifier"&&m.nameā‰ "PassiveAggressiveClassifier"&&m.nameā‰ "NuSVC"&&m.nameā‰ "LinearSVC"&&m.nameā‰ "DeterministicConstantClassifier"
        println(m.name)
        feature_namesred=[]
 for j=1:numfeatures  
    println(j)       
for row in 1:nummontecarlo
    feature_namesred=sample(feature_names,j,replace=false)
         model_name=m.name
    package_name=m.package_name

    eval(:((clf = @load $model_name pkg=$package_name verbosity=0)))

    moddd=MLJ.FeatureSelector(features=feature_namesred)|>clf
    push!(modelsxx,moddd)
    push!(model_names, m.name)
    push!(package_names,m.package_name)
end
end    
end;
end
multi_model = TunedModel(models=modelsxx,
						 resampling=CV(nfolds=5),
						 measure=log_loss,#acceleration=CPUThreads(),
						 check_measure=true)

                         e = evaluate(multi_model, X, y,
			 resampling=CV(nfolds=2),
			 measure=[auc,accuracy],
			 verbosity=0)

          println(e.report_per_fold[1].best_model)
                println(e.per_fold)
1 Like