[ANN] LearnAPI.jl - Proposal for a basement-level machine learning API

I think LearnAPI does mostly avoid specifying any type restrictions. The only exception I’m aware of is the subtyping of LearnAPI.Algorithm, but it looks like that is optional:

The subtyping MyRidge <: LearnAPI.Algorithm is optional but recommended where it is not otherwise disruptive.

Not sure why it should be recommended at all…

This discussion is really great, and I just want to applaud the efforts if everyone (especially @ablaom who will actually have to instantiate it). I was going to make a quip about bike shedding, but honestly I feel like the whole discussion has been really informative, and is actually important.

I will refrain from sharing my opinion, as I am way less informed than most that are commenting. But I wanted to suggest that, whatever name is chosen, that some summary of this discussion gets memorialized in the docs, if for no other reason than discoverability. IfI come looking for something that I think is called a “model” based on Wikipedia, it would be good to be able to read any what LearnAPI thinks a model is, and where to find what I actually need.

12 Likes

Some more word ideas – maybe one of them clicks.

framework
scheme
approach
technique
recipe
blueprint
plan
platform

FWIW, SAS JMP uses “platform”.

As the field is called “Machine Learning”, I believe “Machine” or “Learner” will always be the most discoverable and inntuitive, which is more important than strict correctness IMO.

I think I derailed this thread with bikeshedding. Package developers can name their types whatever they want to. It would be nice to have some conventions, like the *Classifier and *Regressor convention, but it’s not essential. And I think the question that @ablaom was asking had more to do with how to name these concepts in the LearnAPI documentation, rather than anything to do with names of types.

There are still plenty of substantive discussions to be had around the API design. I would like for LearnAPI to be generic enough to be used across the ecosystem, not just by MLJ. To be honest, right now LearnAPI feels like MLJModelInterface version 2.0.

@ablaom Can you provide a couple simple examples around how state, fitted_params, update!, and ingest! are intended to interact? I currently don’t have a very good mental model for how they would be used in practice.

I find Algorithm too general.
Between Learner, Algorithm and Strategy, I prefer Learner.

Otherwise, what about Method, LearningAlgorithm, LearningStrategy or LearningMethod?

Naming decision

Thanks to all for a lively discussion with lots of participation, which I appreciate enormously. As I count it, we have:

  • strategy 5
  • learner 3
  • algorithm 2 (counting my own vote)

Two people have argued for config/options.

I will revise the documentation and variable names in code to use stategy.

To clarify some of the discussion, implementers can call their strategy structs whatever they like. There is no compulsory supertype for strategy structs. At present, this is the only struct that a new implementation needs to contribute. For this reason, I imagine the strategy struct for ridge regression would just be RidgeRegressor, with no need for a suffix of any kind. (This is how it is in MLJ.)

@kevbonham I can create a link to this naming discussion in the docs, if you like.

3 Likes

I do like the idea of combining strategy, learned parameters, state, and report (other byproducts of training) into a single object fit-result object (dare I call this a “model”?), which has been suggested by @CameronBieganek @jbrea and others several times - the form of this object being left up to the implementation. I think I have been stuck on the “split” design because it gives maximal flexibility while reducing the burden on new implentations: there is no need to implement separate fitted_params and report methods. By itself, this now feels like a weak argument.

There is another reason for the split output design which concerns serialization. It can happen that for data anonymisation and memory reasons you need to dump state (and report?) before serialising. We might choose to only serialise the learned parameters, but then how do we put the combined FitResult object back together for prediction?

But now I think about it, we can handle this with a serialisable method, which transforms the combined FitResult into something suitable for serialization. Actually we need this anyway (and have it in MLJModelInterface) because not all learned parameters are persistent (in XGBoost.jl is a C-pointer) and even some learned parameters have baggage (composite models in MLJ). And of course, 95% of models will be happy with the fallback, which is the identity, so no big burden on new implementations.

2 Likes

@jbrea

  1. I’m not so happy with how verbosity is treated in MLJ and in the LearnAPI proposal. i.e. I prefer the keyword approach of MLJ

The suggestion to make verbosity a kwarg was raised earlier. Here is my earlier response:

I suppose if we extend the notion of “metadata” to include verbosity, this would address
your concern. Conceptually this feels a bit of a stretch. We’d have to worry about
verbosity in every implementation of the optional data interface, which could be
annoying. Again, this feels like something we’re only doing for user-friendliness, but
I’ll consider it further.

I’m still thinking about it :wink:

but I would like to have a simple way to change the default verbosity level. For example a function LearnAPI.default_verbosity(level), and keep the signature fit(alg, X, y; verbosity) to overrule the default.

Sounds like very good idea. Added this issue.

1 Like

Here’s an idea to simplify both the implementation of fit and it’s use by developers and
end users. It is a variation of a suggestion already made by @CameronBieganek.

We have two fit methods: one to be implemented, raw_fit, which is never called outside of the LearnAPI package, and one for calling by end users and developers, fit, which is never overloaded.

Implementer does something like

raw_fit(strategy::MyStrategy, data...) = ...  # `output`
params(::MyStrategy, output) = ... # `parameters` (just the learned ones)
predict(::MyStategy, parameters, newdata...) = ...

(Optionally, she can implement a report method.)

In LearnAPI we have a type for wrapping the output of raw_fit which becomes the output
of fit, defined only in LearnAPI:

struct Model{S,O,P}
    strategy::S
    output::O     # cleared out for serialization, ensembling, etc
	params::P
end 

function fit(strategy, data...)
	output = raw_fit(strategy, data...)
	params = LearnAPI.params(output)
	return Model(strategy, output, params)
end 

predict(model::Model, newdata...) = predict(model.strategy, model.params, newdata...)

I’m not stuck on Model as the name for this wrapper.

The user or developer extending LearnAPI can do this:

model = fit(strategy, data...)
predict(model, newdata...)

And predict still works if we clear out the full output part of the
model for serialization, ensembling, etc.

minimize(model::Model) = Model(model.strategy, nothing, model.params)

Otherwise, the full output is available for updating methods that need state.

Thoughts?

1 Like

Thanks, @ablaom, for the clarifications and the new idea! Overall I like it, but I don’t yet understand the need for raw_fit. Wouldn’t it be sufficient to have

# mandatory
model = fit(strategy, data...) # returns struct of type T, e.g. T <: Model
predict(model::T, newdata...)  # implements predict for models of type T
# optional
params(model::T) # returns fitted params; e.g. params(m::Model) = m.params
report(model::T) # returns some report; e.g. report(m::Model) = m.output

and encourage the implementer to use the Model{S, O, P} type to return the fitted model from fit, but if the implementer prefers, it would also be fine to return a custom type MyModel.

What would be the benefit of having two fit functions, raw_fit and fit?

@jbrea Thanks for entertaining my suggestion.

  1. I’m not seeing how your suggestion adresses the problem of wanting predict to work on a minimised version of the fit return value (learned parameters only), which we need for, eg, deserialised models (model “state” is very large or not properly anonymised in some cases). Won’t we need the implementation to include a second predict method, to dispatch only on the learned parameters, an extra burden?

  2. In my suggestion, there is no need for the implementer to introduce a new type for the return value (she could just return a triple as in the current API) only a new method. And she must introduce a new type to avoid clashes with types that another user might use, right? So, in terms of the burden on the implementer, it’s the same. (Personally, I’d rather implement a method than add a type.)

No. What I imagined was the following

# The implementer writes the following
struct MyStrategy end
function fit(strategy::MyStrategy, X, y)
    output, params = "fitted", X \ y
    return Model(strategy, output, params)
end
predict(model::Model{<:MyStrategy}, X) = X * model.params

# The user does
X = rand(10, 5); y = rand(10)
model = fit(MyStrategy(), X, y)
predict(model, X)
minimal_model = minimize(model)
predict(minimal_model, X) # works because we dispatch only on the first parameter

I don’t think so. The implementer could use another type like MyStrategyModel instead of Model{MyStrategy}, but I don’t see a good reason for doing this and I would disencourage it.

2 Likes

The new proposal with raw_fit and Model{S,O,P} still seems too complicated to me. And it’s a leaky abstraction. Suppose I wrote a random forest package, and I want to implement a trees method that acts on the output of fit and returns the ensemble of fitted decision trees. It would have to look something like this:

trees(m::Model{RandomForestRegressor}) = m.params.trees

In other words, the implementation of trees depends on the Model type having a params field. But we don’t usually make struct fields part of APIs, especially APIs that are meant to be very generic and ecosystem-wide. I suppose you could add a params method to extract params from a Model object (ignoring the fact that you proposed the params function for a different purpose), but it all seems more complicated and less generic than necessary. Anyways, as a package developer I don’t want to have to reach through an external type imposed by LearnAPI in order to access the internals of an object that I’ve implemented myself.

I’m going to write out an interface proposal here, because the whole thing is actually rather short and simple (ignoring traits, target proxies, sci-types, etc).

LearnInterface.jl

module LearnInterface

export fit, predict, minimize

"""
    fit(params, X, y)

Run a learning algorithm on the features `X` and target `y` with the
algorithm hyperparameters specified in `params`. Returns an object
`model` that can make predictions on new features `Xnew` by calling
`predict(model, Xnew)`.
"""
function fit end

"""
    predict(model, X)

Make predictions with `model` on the features `X`. The object
`model` is the output of a call to `fit`.
"""
function predict end

"""
    minimize(model)

Return a minimal version of `model` suitable for serialization.
`minimize` satisfies the following contract:

    predict(model, X) == predict(minimize(model), X)

LearnInterface.jl provides the following default implementation:

    minimize(model) = model
"""
minimize(model) = model

end

So, by default, an implementer of a LearnInterface compatible model needs to implement two types and two methods:

  • A type for the params object (just the hyperparameters)
  • A type for the model object
  • fit
  • predict

If the implementer chooses to customize serialization, they will need to implement the following:

  • A type for the output of minimize(model)
  • Another method of predict that dispatches on the output of minimize(model)

Note that in this interface there is no need for a params method or a report method.

This seems to me like a very simple and intuitive interface, and it is very generic.

8 Likes

One of the codesigners of MLJ, who is no longer active in the project, was adamant that the learned parameters of every model be inspectable, using a common interface. So (implicitly) I have been viewing this as a requirement of the API.

However, now I revisit this, I struggle to really understand why this needs to be a hard requirement. As far as I can tell, nowhere in MLJ itself do we use this feature in an essential way. It makes sense for a simple model like a linear one, but generally “learned parameters” take a form that have an implementation-specific interpretation and raw learned parameters are not that useful, apart from input to predict.

If we drop accessible learned parameters as a hard requirement, then I agree your proposal looks more sound. I will ponder this some more.

4 Likes

Yeah, I think learned parameters are so idiosyncratic from model to model that it’s not really possible to have a uniform interface for inspecting them.

Taking scikit-learn as an example, the public learned attributes of LinearRegressor and RandomForestRegressor (listed in the “Attributes” section) are quite different.

3 Likes

sklearn has a uniform interface to inspect learned params: use the variable name to know if something comes from learning. In sklearn you can just look for variables that end with _ . for example a Tfidf vectorizer has a vocabulary_ dict learned from the data and a logistic regression has a C_ regularization. therefore you can just iterate over attributes of an object and check if they end up with underscore.

1 Like

I think I would call that a naming convention rather than an interface. It still comes down to reading the docstring to determine which learned parameters are available for inspection. Packages implementing new models can provide this in a Julian way, like this:

"""
Docs here.
"""
struct RandomForestRegressorParams
    # Hyperparameters here.
end

"""
    RandomForestRegressor

A fitted random forest regressor. Created by calling `fit` with
`RandomForestRegressorParams` and training data.

# Examples

``
julia> params = RandomForestRegressorParams()

julia> model = fit(params, X, y)

julia> hyperparams(model) == params
true
``

# Methods

- `predict(model, X)`: Make predictions on the features `X`.
- `hyperparams(model)`: Return the hyperparameter settings for `model`.
- `trees(model)`: Return the fitted decision trees in the random forest.
- `feature_importances(model)`: Return the impurity-based feature importances.
- `oob_score(model)`: Return the score of the training dataset obtained using
  an out-of-bag estimate.
- `oob_predictions(model)`: Return out-of-bag predictions.
"""
struct RandomForestRegressor
    # Fields.
end

As a side note, looking at this example makes me think that hyperparams should be in LearnAPI as a generic method to recover the hyperparameter object from a fitted model.

1 Like

My thinking, already mentioned in the proposal, is that all methods implemented get listed in the LearnAPI.functions trait, so we can be more programmatic about this. We could have a published API for common accessor functions like feature_importances, and even provide tests that the API is correct.

As a side note, looking at this example makes me think that hyperparams should be in LearnAPI as a generic method to recover the hyperparameter object from a fitted model.

Yes, but perhaps you could agree to call this strategy now, as this naming issue is now closed, at least as far as I am concerned :wink: .

Ah, shoot. I had thought that we sidestepped the naming issue since we clarified that input and output types can be named anything, and no LearnAPI type needs to be subtyped. So all that was really under discussion was the language used in the LearnAPI documentation. But if we have a hyperparameter/options/strategy/algorithm extraction function, then it starts to matter again (but only a little). :joy:

I can understand the desire for a term that is shorter, colorful, or more evocative, but when it comes down to it, the term that is widely used in the machine learning community is “hyperparameters”. The words “options” and “strategy” are not really used. So the clearest and most natural way to refer to an object that holds hyperparameters is “the hyperparameter object”, or just “the hyperparameters”. If we use the terminology “strategy”, “options”, “learner”, or “algorithm” to refer to an object whose sole purpose is to hold hyperparameters, I think we are likely to cause confusion. If we refer to that object as “the hyperparameter object”, it’s crystal clear what we mean.

2 Likes