Feature selection+classification pipeline

Michele_Avanzo · May 20, 2022, 1:23pm

Hi,
I am relatively new to Julia.
I find it very difficult to build a pipeline that selects optimal subset of features from a large number of features and then training a ML classifier.
Currently, I first apply ShapML to rank features by importance, then use TunedModel to find optimal number of the ranked features to include in model, than train ML, but this seems very suboptimal. I tried Relieff but was not able to run it due to dependency issue. Unfortunately other common feature selection algorithms are not available in Julia (e.g. the recursive).
I was wondering if a pipeline could be designed and trained where some features are randomly chosen, then given to ML in order to provide optimal subset.

I would be happy to share code if needed

Thanks in advance

ablaom · May 20, 2022, 9:53pm

Adding support for recursive feature elimination, and other feature selection tools, is currently high priority in MLJ.

I was wondering if a pipeline could be designed and trained where some features are randomly chosen, then given to ML in order to provide optimal subset.

One approach would be to use TunedModel(models=..., ) (see here) and manually generate models as mutations of the pipeline model pipe=FeatureSelector |> your_model (mutating pipe.feature_selector.features). I can write out an example later next week if you get stuck. With a bit more work one could translate this into a general re-usable model wrapper, using learning networks but that would be a bit more advanced.

Michele_Avanzo · May 21, 2022, 6:57am

thanks, ablaom, you gave me a lot of ideas to play with! I’ll let you know how it goes

best

Michele_Avanzo · June 7, 2022, 11:04am

here we are! works perfectly, now I’ll try to do things sequentially, e.g. first find best model with one features, than a adding feature in next iteration
thanks again



using Symbolics

feature_names=Symbolics.scalarize(schema(X).names |> collect)

model_names=Vector{String}()
package_names=Vector{String}()

modelsxx=[]

numfeatures=5
nummontecarlo=50
feature_namesred=[]
[feature_names[1:2];feature_names[1:2]]
for m in [models(matching(X1, y1))[1:20];(name="NeuralNetworkClassifier",package_name="MLJFlux")]
    if m.name≠"LinearBinaryClassifier"&&m.name≠"LDA"&&m.name≠"GaussianNBClassifier"&&m.name≠"BayesianLDA"&&m.name≠"DSADDetector"&&m.name≠"ESADDetector"&&m.name≠"GaussianProcessClassifier"&&m.name≠"PerceptronClassifier"&&m.name≠"RidgeCVClassifier"&&m.name≠"RidgeClassifier"&&m.name≠"SVC"&&m.name≠"SVMClassifier"&&m.name≠"SVMLinearClassifier"&&m.name≠"SVMNuClassifier"&&m.name≠"SGDClassifier"&&m.name≠"PassiveAggressiveClassifier"&&m.name≠"NuSVC"&&m.name≠"LinearSVC"&&m.name≠"DeterministicConstantClassifier"
        println(m.name)
        feature_namesred=[]
 for j=1:numfeatures  
    println(j)       
for row in 1:nummontecarlo
    feature_namesred=sample(feature_names,j,replace=false)
         model_name=m.name
    package_name=m.package_name

    eval(:((clf = @load $model_name pkg=$package_name verbosity=0)))

    moddd=MLJ.FeatureSelector(features=feature_namesred)|>clf
    push!(modelsxx,moddd)
    push!(model_names, m.name)
    push!(package_names,m.package_name)
end
end    
end;
end
multi_model = TunedModel(models=modelsxx,
						 resampling=CV(nfolds=5),
						 measure=log_loss,#acceleration=CPUThreads(),
						 check_measure=true)

                         e = evaluate(multi_model, X, y,
			 resampling=CV(nfolds=2),
			 measure=[auc,accuracy],
			 verbosity=0)

          println(e.report_per_fold[1].best_model)
                println(e.per_fold)

Topic		Replies	Views
Univariate feature selection Machine Learning mlj	1	662	September 18, 2022
Machine Learning Classification Machine Learning question	3	502	January 13, 2023
AutoMLPipeline.jl makes it easy to create complexed ML pipeline structures Package Announcements machine-learning	23	2626	March 9, 2020
optimal feature subset selection Optimization (Mathematical)	0	545	July 30, 2018
Automate training MLJ models Machine Learning machine-learning , mlj	14	2116	February 17, 2020

Feature selection+classification pipeline

Related topics