Hi,
I am relatively new to Julia.
I find it very difficult to build a pipeline that selects optimal subset of features from a large number of features and then training a ML classifier.
Currently, I first apply ShapML to rank features by importance, then use TunedModel to find optimal number of the ranked features to include in model, than train ML, but this seems very suboptimal. I tried Relieff but was not able to run it due to dependency issue. Unfortunately other common feature selection algorithms are not available in Julia (e.g. the recursive).
I was wondering if a pipeline could be designed and trained where some features are randomly chosen, then given to ML in order to provide optimal subset.
I would be happy to share code if needed
Thanks in advance
Adding support for recursive feature elimination, and other feature selection tools, is currently high priority in MLJ.
I was wondering if a pipeline could be designed and trained where some features are randomly chosen, then given to ML in order to provide optimal subset.
One approach would be to use TunedModel(models=..., )
(see here) and manually generate models
as mutations of the pipeline model pipe=FeatureSelector |> your_model
(mutating pipe.feature_selector.features
). I can write out an example later next week if you get stuck. With a bit more work one could translate this into a general re-usable model wrapper, using learning networks but that would be a bit more advanced.
1 Like
thanks, ablaom, you gave me a lot of ideas to play with! Iāll let you know how it goes
best
here we are! works perfectly, now Iāll try to do things sequentially, e.g. first find best model with one features, than a adding feature in next iteration
thanks again
using Symbolics
feature_names=Symbolics.scalarize(schema(X).names |> collect)
model_names=Vector{String}()
package_names=Vector{String}()
modelsxx=[]
numfeatures=5
nummontecarlo=50
feature_namesred=[]
[feature_names[1:2];feature_names[1:2]]
for m in [models(matching(X1, y1))[1:20];(name="NeuralNetworkClassifier",package_name="MLJFlux")]
if m.nameā "LinearBinaryClassifier"&&m.nameā "LDA"&&m.nameā "GaussianNBClassifier"&&m.nameā "BayesianLDA"&&m.nameā "DSADDetector"&&m.nameā "ESADDetector"&&m.nameā "GaussianProcessClassifier"&&m.nameā "PerceptronClassifier"&&m.nameā "RidgeCVClassifier"&&m.nameā "RidgeClassifier"&&m.nameā "SVC"&&m.nameā "SVMClassifier"&&m.nameā "SVMLinearClassifier"&&m.nameā "SVMNuClassifier"&&m.nameā "SGDClassifier"&&m.nameā "PassiveAggressiveClassifier"&&m.nameā "NuSVC"&&m.nameā "LinearSVC"&&m.nameā "DeterministicConstantClassifier"
println(m.name)
feature_namesred=[]
for j=1:numfeatures
println(j)
for row in 1:nummontecarlo
feature_namesred=sample(feature_names,j,replace=false)
model_name=m.name
package_name=m.package_name
eval(:((clf = @load $model_name pkg=$package_name verbosity=0)))
moddd=MLJ.FeatureSelector(features=feature_namesred)|>clf
push!(modelsxx,moddd)
push!(model_names, m.name)
push!(package_names,m.package_name)
end
end
end;
end
multi_model = TunedModel(models=modelsxx,
resampling=CV(nfolds=5),
measure=log_loss,#acceleration=CPUThreads(),
check_measure=true)
e = evaluate(multi_model, X, y,
resampling=CV(nfolds=2),
measure=[auc,accuracy],
verbosity=0)
println(e.report_per_fold[1].best_model)
println(e.per_fold)
1 Like