[ANN] LearnAPI.jl - Proposal for a basement-level machine learning API

If you see the source code check_is_fitted the string name of model attributes is used to determine if a model is fitted. Technically, this is not an interface but one can see that the naming convention is key to tell if a new method added by developers is fitted or not (hence the method to tell if a method is fitted is indirecly defined by the naming convention of learned parameters). One can get learned parameters of any model or transformer with get_learned_parameters = lambda x: [attr for attr in vars(x) if attr.endswith('_')]

I think it is actually kind of clean, model objects do not need to store an iterable to tell users the ‘learned parameters’ (such as learned_params = ['bias_', 'weights_']), they can be inferred by the string name. As long as new code follows the naming convention check_is_fitted will work and the same get_learned_parameters method would work.

Such naming conventions are super useful for users. When using the package, if you see a variable that starts with n_, such as n_neighbors you already see it’s meant to be an integer.

Coming back to this after a while.

I am still keen on something close to @CameronBieganek suggestion. I would appreciate some help to address one fly in the ointment that I see in this and other approaches. I feel quite stuck on this point.

The problem concerns algorithms that do not generalize to new data. A good illustration of the issue is DBSCAN clustering (and many other clustering algorithms): the algorithm constructs some clusters for a data set X, which it can then use to label observations in X, but you cannot use the method to assign labels to some new data X_production. Such algorithms fit into the proposed scheme easily enough: fit consumes no data and returns nothing; predict consumes the data, performs the cluster algorithm, and outputs the labels. The problem is providing an interface point for byproducts of the clustering process. In the DBSCAN example these might be additional tags “boundary”, “centre” or “noise” for the observations. These are not byproducts of training because there isn’t any. Only predict gets to even see the data, but predict would generally be expected to output only “predictions” (labels) and nothing else.

This issue caused a lot of gymnastics in MLJ. In the end, MLJModelInterface.predict and MLJModelInteface.transform have been allowed to return two items, (output, report), but we had to make a trait to flag the exceptional behaviour to avoid breakage. And there was a lot understandable grumbling. (Part of the constraint was that models of this kind had to look enough like other models to easily fit in to the existing model composition API.)

Thoughts anyone?

By the way, in asking this question, I am still keen to avoid mutable structs.

I’m not sure this helps, but my first impression is that fit would compute the mapping to clusters and the other information, and predict would be the function that does nothing and just uses the labels from the information stored in the fit.

1 Like

Yeah, I think it makes sense for the clustering to happen in fit. I looked at scikit-learn, and that appears to be what they do. In scikit-learn, predict methods are only provided for the clusterers that can label new points, so KMeans has a predict method and DBSCAN does not have a predict method.

Also, in the discussion above I was hung up on needing to have two separate types for each model, so that the fit signature would look like this:

fit(m::MyModelParams, X, y) :: MyModel

However, it has finally occurred to me that we can just use one immutable struct where for an untrained model the fields that represent fitted parameters are set to nothing (either by making the field parametric or making it a Union{T, Nothing}). For example, the code for RandomForestRegressor could look like this:

struct RandomForestRegressor
    n_trees::Int
    min_samples_leaf::Int
    trees::Union{Vector{DecisionTree}, Nothing}
    oob_score::Union{Float64, Nothing}
end

function LearnAPI.fit(m::RandomForestRegressor, X, y)
    # ...
    # trees = ...
    # oob_score = ...
    RandomForestRegressor(
        m.n_trees,
        m.min_samples_leaf,
        trees,
        oob_score
    )
end

Here’s some more code that demonstrates what the clustering models could look like:

Clustering models
module LearnAPI
    function fit end
    function predict end
    function cluster_lables end
end

struct DBSCAN
    eps::Float64
    min_samples::Int
    centers::Union{Vector{Vector{Float64}}, Nothing}
    labels::Union{Vector{Int}, Nothing}
end

function LearnAPI.fit(m::DBSCAN, X)
    # ...
    # centers = ...
    # labels = ...
    DBSCAN(m.eps, m.min_samples, centers, labels)
end

# No predict method for DBSCAN.

function LearnAPI.cluster_labels(m::DBSCAN)
    if isnothing(m.labels)
        throw(ArgumentError("DBSCAN model not fit yet."))
    end
    m.labels
end

# More specific extraction functions are probably not
# defined in LearnAPI.
function cluster_centers(m::DBSCAN)
    if isnothing(m.centers)
        throw(ArgumentError("DBSCAN model not fit yet."))
    end
    m.centers
end

struct KMeans
    n_clusters::Int
    centers::Union{Vector{Vector{Float64}}, Nothing}
    labels::Union{Vector{Int}, Nothing}
end

function LearnAPI.fit(m::KMeans, X)
    # ...
    # centers = ...
    # labels = ...
    KMeans(m.n_clusters, centers, labels)
end

function LearnAPI.predict(m::KMeans, Xnew)
    # Return cluster labels for new data.
end

function LearnAPI.cluster_labels(m::KMeans)
    if isnothing(m.labels)
        throw(ArgumentError("KMeans model not fit yet."))
    end
    m.labels
end

# More specific extraction functions are probably not
# defined in LearnAPI.
function cluster_centers(m::KMeans)
    if isnothing(m.centers)
        throw(ArgumentError("KMeans model not fit yet."))
    end
    m.centers
end
1 Like
fit(m::MyModelParams, X, y) :: MyModel

Would the lowercase y imply a vector, or could it also be a multivariate regression?

Well, it’s just a variable name. What matters is what is in the docstring for fit. I don’t think we discussed it above, but that argument could probably be allowed to be multivariate.

I’m starting to wonder if trying to make pipelines and models immutable is swimming against the current. Maybe it’s good enough to just make the hyperparameter fields in a model struct const and let the fitted parameter fields be mutable, like this:

mutable struct RandomForestRegressor
    const n_trees::Int
    const min_samples_leaf::Int
    trees::Union{Vector{DecisionTree}, Nothing}
    oob_score::Union{Float64, Nothing}
end

Then the interface usage would look like this,

model = RandomForestRegressor()
fit!(model, X, y)
ŷ = predict(model, Xnew)

with models optionally providing extraction functions to extract information from fitted models, like this:

# Get the out-of-bag score for a fitted RandomForestRegressor. 
oob_score(model)

Of course, const fields requires at least Julia v1.8.

The LearnAPI documentation could contain guidelines that strongly encourage model implementers to make hyperparameter fields const.

What would be advantages of a mutable interface here?
Technically, I would argue that fit changes the type, i.e., from a mere model specification including only the hyper-parameters to an actually fitted model including the data-dependent parameters. This is also what some R libraries such as tidymodels do and further, one can simply use dispatch to decide if predict can be used or not – maybe with a default implementation for model specifications as well:

predict(m::AbstractModelSpec, Xnew) = error("ModelSpec cannot be used for prediction. You probably want to `fit` the model first!")
# If needed, one can also define
isfitted(m::AbstractModelSpec) = false
isfitted(m::AbstractModelFit) = true

Am I missing anything here? I.e., does a mutable interface have any other benefits besides looking like standard OOP?

I think it is natural to think of a machine learning model like a random forest as a bag of hyperparameters and fitted parameters. The hyperparameters and fitted parameters together are “the model”. Also, combining the hyperparameters and the fitted parameters into one type reduces the number of types exposed to the user by a factor of 2. And this can be done either with immutable model types or mutable model types. (I have example code a few posts above for hyperparameters and fitted parameters combined into a single immutable struct.)

Also, if we wanted to 100% follow the Spec -> Fit philosophy that you are recommending, we would also need to apply it to transformers. So we would need StandardizerSpec plus StandardizerFit, and FillImputerSpec plus FillImputerFit.

(I would probably drop the suffix Fit from the fitted types.)

Mutability is more natural in Julia than it is in R. In fact, most R data structures are immutable. Even x[2] <- 100 is actually syntax sugar for a pure function call and a rebinding of the variable x. This is the desugared version of x[2] <- 100:

x <- `[<-`(x, 2, 100)
1 Like

Thanks, was just curious and somehow like immutable APIs. You are right though that “mutability is more natural in Julia than it is in R” or some other languages.

1 Like

You can easily make a “mutable” immutable API with BangBang.jl, which seems like the easiest approach to me.

I read through this thread again, which convinced me that the non-mutating interface that I proposed here would be the best and most generic interface. In action, that interface looks like this:

model = # ... Create an object with the model hyperparameters.
modelfit = fit(model, X, y)
ŷ = predict(modelfit, Xnew)

The interface says nothing about the types of the objects model and modelfit. All that matters is that the output of fit can be used by predict. So, there are at least three approaches that interface implementers can use:

  1. Make model and modelfit instances of separate types, like MyModel and MyModelFit, respectively.
  2. Make model and modelfit separate instances of the same type, e.g. MyModel. This can be done even if MyModel is immutable, as I described in a previous post.
  3. Have the output type be a wrapped version of the input type. For example, the input type could be MyModel and the output type could be FitResult{MyModel}.

We could have some style guidelines in the documentation for LearnAPI. Since users define the model object and only interact with the modelfit object via predict (and potentially extraction functions like trees and oob_score that are defined for the specific model), I would recommend the following style guideline:

  • Use plain model names (without a suffix) for the input type which contains only hyperparameters.

So defining a pipeline might look like this:

model = Pipeline(
    Standardizer(),
    Imputer(),
    RandomForestRegressor()
)

Personally I might choose to implement option 2 above for preprocessing transformers and option 1 for ML models, with the suffix of Fit for the output type, e.g. RandomForestRegressorFit.

4 Likes

Looks great @CameronBieganek, love this flexibility.

My 2 cents: I’d prefer modelfit, report = fit(model, X, y).

The report is some container of model metrics/diagnostics. I think it’s cleaner that modelfit contains whatever is required for predict and no more. In practical terms, I’m working on a simulation with hundreds of models, each with a sizable suite of diagnostics. The simulation only needs modelfit for calling predict. Bundling the report with modelfit would muddy the waters and feel a little bloated.

The original proposal that I made above includes a minimize method for extracting only the parts of a model that are needed for prediction. (The name is up for debate.) That’s definitely useful when persisting a model to disk. Would that satisfy your requirements?

I think it is conceptually cleaner to encapsulate all the information about a model fit into one object. Then the model implementer can provide a docstring for the fitted model type, like this:

"""
    RandomForestRegressorFit

A fitted random forest regressor that can be used with `predict`
to predict target values on new input data. Other methods include:

- `trees`: Extract the ensemble of decision trees.
- `oob_score`: Extract the out-of-bag score.
- `oob_predictions`: Extrat the out-of-bag predictions. 
"""

Of course, the model implementer can choose to make the fields of the fitted model type part of the public API instead of providing extraction functions like trees(modelfit) and oob_scores(modelfit).

This is a bit subjective, but when designing generic APIs, I prefer to keep the concepts as sharp as possible. Objects like report or cache seem too vague and opaque to me. What can I do with a report? I know what I can do with a modelfit: I can use it with predict to predict new values. But there are no methods for a report object, so I don’t know if it quacks like a duck or meows like a cat.

such a method should probably also take some kwargs or maybe dispatch on intended prediction domain, since the model may be configured differently to optimize for batch prediction vs single row

1 Like

Another case where minimize() would be different would be using a RF regression for regular predictions, where the feature importances would be informational but not required for prediction, as opposed to using a RF model where the importance is an input to another model in the pipeline to select features or some other logic.

That’s my point. We agree that the model object has a clear purpose, namely to be fed to the predict method. The purpose of a report object is less clear - could be fed to a pipeline for model selection, or in my case to a visualization tool to compare against other models for the same response variable, as well as to serve as a way for colleagues to audit the selected model.

In the interest of keeping concepts as sharp as possible, the content required for prediction should be kept separate from everything else. I find the minimize method, in addition to a second predict method, less clear than simply returning fitted, report from the fit method, with fitted then fed to the one and only predict method.