That’s true, I guess I’ve been arguing from the point of view of my counter-proposal near the top of this thread. In that counter proposal I propose that fit
return an actual type that is documented with a docstring. It seems cleaner to me to return a single object, rather than the tuple (fitted_params, state, report)
.
Your proposal of returning a FitResult{RandomForestRegressor}
is similar in spirit, although as far as I know it’s not possible to attach a docstring to a specific instantiation of a parametric type. On the other hand, maybe I’m thinking too much like an R programmer. In R, object fields are often a documented part of the API, whereas we usually don’t do that in Julia. So, the creator of a random forest library could add a method like trees(::FitResult{RandomForestRegressor})
to extract the ensemble of fitted decision trees, and the trees
method would of course have a docstring.
However, as a package developer, I want to have more control over my input and output types. I don’t want to be forced to subtype LearnAPI.Algorithm
. To take an example from the ecosystem, the supertype of DataFrame
is AbstractDataFrame
, not Table
. In fact, there is no Tables.Table
type. Furthermore, I should be able to return whatever type I want to from fit
. I don’t think we should use tuple output types except for the rare cases where it makes intuitive sense, like min, max = extrema(x)
or div, rem = divrem(x)
. If MLJ needs a state
object somewhere, then the API for that should look something like the following:
model = fit(options, X, y)
s = state(model)
LearnAPI could provide a default state
method like this:
struct Stateless end
state(::Any) = Stateless()
This way most custom model types don’t even need to worry about the state
method.
Additionally, the report
concept need not enter into LearnAPI. It’s too ill-defined of a concept. Instead, custom model types can implement whichever specific inspection methods make sense for their fitted model type, e.g.
options = RandomForestRegressorOptions()
model = fit(options, X, y)
# Inspect the learned ensemble of decision trees:
trees(model)
# Inspect the out-of-bag predictions:
oob_predictions(model)
# Inspect the max tree depth hyperparameter:
max_tree_depth(model)
My broader concern with LearnAPI is that it is targeted at package developers. I believe we need a model interface that is easy to implement for both package developers and end users. In my day job as a data scientist, I use scikit-learn. I am an end user. I am developing application code, not library code. Yet I still regularly implement my own custom scikit-learn model types—and it is very simple and easy to do. It seems to me that implementing custom model types in LearnAPI (and the similar MLJModelInterface) is more complicated and less intuitive than it is in scikit-learn. It’s not friendly to end users.