Automatic Creation of a Grid of Tuning Parameters

Currently, in R’s caret package if we do something like this:

fitControl <- trainControl(method = "repeatedcv", number=10,repeats=5)
model_rf = train(responder~ ., data=trainData, method='rf',trControl=fitControl)

a grid of tuning parameters is automatically created.
Can something like this be done using MLJ.jl ?

Sure, MLJ provides a range of tuning strategies including Grid. For repeated (aka Monte Carlo) cross validation, give TunedModel the options resampling=CV(nfolds=10, rng=123) and repeats=5. For other options query the TunedModel docstring.

Note that in MLJ tuning is implemented as a model wrapper, as in MLR/MLR3. The wrapped model can be viewed as a “self-tuning” version of the original model. Under the hood the provided resampling strategy (eg, CV) is applied to determine the optimal hyperparameter(s) and then the atomic model is retrained with these parameters using all the data.

Tuning docs
A tuning tutorial
Another tuning tutorial
One of several end-to-end examples with tuning

1 Like

Thank you for your response @ablaom

I should have mentioned my problem a little more clearly. What I am looking for is as follows:

In R, as mentioned in the main question, a range of parameters is automatically created by running the 2 lines of code. To be more specific, for random forest classification in R, a grid of values for n_subfeatures (or mtry in R) is automatically created by caret.

In MLJ, if I want to tune a RandomForest, I will have to do something like:

RandomForestClassifier = @load RandomForestClassifier pkg=DecisionTree
rf_model = RandomForestClassifier()

range_rf = range(rf_model, :n_subfeatures, values=[2,6,10])
self_tuning_rf = TunedModel(model=rf_model, resampling=CV(nfolds=10), 
repeats=5, tuning=Grid(), range=range_rf, measure=[accuracy, kappa])

rf = machine(self_tuning_rf, X, y)

MLJ.fit!(rf, rows=train)

The problem here is the the TunedModel function expects a range. If I run without a range, I will get the following error:

julia> self_tuning_svm = TunedModel(model=rf_model, resampling=CV(nfolds=10), 
repeats=5, tuning=Grid(), measure=accuracy)
ERROR: LoadError: ArgumentError: You need to specify `range=...`, unless `tuning=Explicit` and and `models=...` is specified instead. 
Stacktrace:
...

All in all, what I want is some sort of implementation where I can run the TunedModel function without passing anything into the range argument and it automatically choses one or two or more parameters to tune depending on the model (like caret chooses mtry for random forest, cp for decision tree) and creates a grid based on the type of problem (probabilisitc) and the dataset (number of features, number of rows, data schema, etc.) that is passed like caret does. Hope I am clear.

Ah, thanks for clarifying. Yes, my understanding is that caret stores in its model metadata default ranges for each hyper-parameter. MLJ does not yet provide this cool feature.

We had thought about this but my current inclination would be to provide instead default prior probability distributions, as I think RandomSearch is a better all-purpose strategy that Grid. There is an OpenML project which has been determining good default priors for popular models by “learning” these priors using a battery of OpenML datasets. Do you have thoughts on this suggestion?

I agree. Also, I think we should inculcate some information such as number of features, type of prediction problem, etc. of the training dataset into the priors.