[ANN] LearnAPI.jl - Proposal for a basement-level machine learning API

Feedback is being sought on a new, detailed proposal, called LearnAPI.jl, for a general purpose low-level machine learning interface. The proposal is based on our three years’ experience developing and maintaining the machine learning toolbox MLJ.jl. Ultimately, it may replace the current foundation of the MLJ ecosystem, but:

  1. Is less tied to specific requirements of MLJ

  2. Is extensively and more clearly documented (with a User Guide also planned)

  3. Is easier for existing model implementations to adopt (e.g., no abstract model types to subtype)

  4. Is general enough to accommodate the large variety of machine models already integrated with MLJ and more

  5. Is agnostic to the form of data consumed by models. Using an optional data interface, each model articulates how user-provided data is coerced to model-specific representations (if these are different) and how observations within the data can be accessed (to allow for external control in resampling, for example).

To be clear, this proposal concerns a low level API, not a full blown ML toolbox.

For details, visit the docs.

Feedback welcome here, on the #mlj Julia slack channel, or, to provide private feedback, direct message me on Julia slack. My handle is @ablaom.

43 Likes

Thanks for the initiative! I was really looking forward to this!

5 Likes

Thank you for creating this proposal! I haven’t read the entire docs yet, but here is my initial reaction.

  • I’m not a fan of passing around nothing in 90% of cases because you don’t need one of the inputs or outputs that the API specifies, e.g. state = nothing for a model fit or report = nothing for a model prediction.
  • I would much prefer verbosity to be a keyword argument, rather than a positional argument that occurs before X and y.

Maybe it is intended that this interface is not the interface that the end user will see. Maybe it is intended that a higher-level ML API will be built on top of this interface. But that doesn’t feel right to me, because LearnAPI is claiming some very fundamental names like fit and predict. I would prefer that this API be usable by end users without needing to use a higher-level library. As such, fit and predict should be more ergonomic, in my opinion. For example,

fitted_params, state, report = fit(model, verbosity, X, y)

is not very ergonomic. Neither is the following ergonomic:

yhat, report = predict(model, fitted_params, Xnew)

Now take a look at this function call from the ridge regression example in the docs:

feature_importances(model, fitted_params, fit_report)

In this example only one of the arguments to that function (fit_report) is actually used by the function. That seems clunky and wasteful.

A model is just a container for hyper-parameters.

This seems like a misnomer to me. Surely a model consists of both the hyper-parameters and the learned parameters, and whatever other auxiliary information seems appropriate…

Here’s the API that I would like to see. The hyper-parameter types that are currently called “models” would be renamed to *Options. Then fit would return a single output—the actual model type that includes all the relevant information: hyper-parameters, learned parameters, feature importances, metadata, etc. So for example, a call to fit would look like this:

model = fit(options, X, y)

Predict would look like this:

yhat = predict(model, Xnew)

And feature_importances would look like this:

importances = feature_importances(model)

As a concrete example, I’ve re-written the ridge regression example the way that I would like to see it:

My version of LearnAPI
using LearnAPI
using Tables

struct RidgeRegressorOptions
    lambda::Float64
end

RidgeRegressorOptions(; lambda=0.1) = RidgeRegressorOptions(lambda)

struct RidgeRegressor
    options::RidgeRegressorOptions
    coefs::Vector{Float64}
    importances::Vector{Pair{Symbol, Float64}}
end

function LearnAPI.fit(options::RidgeRegressorOptions, X, y; verbosity=0)
    x = Tables.matrix(X)
    s = Tables.schema(X)
    features = s.names

    coefs = (x'x + options.lambda*I)\(x'y)

    importances = [features[j] => abs(coefs[j]) for j in eachindex(features)]
    reverse!(sort!(feature_importances, by=last))

    verbosity > 0 && @info "Features in order of importance: $(first.(feature_importances))"

    RidgeRegressor(options, coefs, importances)
end

LearnAPI.predict(model::RidgeRegressor, Xnew) = Tables.matrix(Xnew) * model.coefs
LearnAPI.feature_importances(model::RidgeRegressor) = model.importances

It would be nice to have concrete fitted model types like RidgeRegressor that can have their own docstrings that document how to introspect the learned parameters. For example, a RandomForestRegressor could have a docstring that explains how the user can examine the individual trees within the forest. Splitting a model into separate fitted_params, state, and report objects (plus the original hyper-parameter object) seems cumbersome and harder to document.

17 Likes

I’m quite aligned with the general considerations brought by @CameronBieganek.

However, I’d tend to favor the view that a model should be limited to the actual learned models, along the “recipe” that convert those params into a prediction.

My variation of the ideal API would be:

struct RidgeRegressorOptions
    lambda::Float64
end

struct RidgeRegressor
    coefs::Vector{Float64}
    importances::Vector{Pair{Symbol, Float64}}
end

function LearnAPI.fit(options::RidgeRegressorOptions, X, y; verbosity=0)
    ...
    return (model, fitresults) # fitresults and/or cache
end

function LearnAPI.fit!(model::RidgeRegressor, options::RidgeRegressorOptions, X, y; verbosity=0)
    ...
end

LearnAPI.predict(model::RidgeRegressor, Xnew) = Tables.matrix(Xnew) * model.coefs

The reason for that difference is that for models such as boosted trees or neural nets, I don’t see any restrictions for the hyper-params such as the lambda/L2, or the optimiser for a NN, to change throughout the course of the model fitting. Therefore, tying the last snapshot of the hyper-params with the model doesn’t seem adequate.

Also, when it comes to model deployment, it’s desirable to having a lean model as possible. So for a linear model, I just want coefs and bias, not the various options which can potentially include objects like a random generator.

I think Flux definition of a Model fits well the bill here. model = Chain(Dense(2, 3, relu), ...).
Such model is just the learnable params along the recipes on how to use the params. And it come with the elegant model(x) for inference.

A question mark is that it’s often convenient to generate some cache or scratch data when initializing a model. Should a fit! be forced to have both model and cache argument, or could it support both situations? I don’t see a clear path to accomode all scenarios.

I’d also stress the importance for the API to allow for performance first options where one needs it. For example, not force the computation of features_importance if not needed. Perhaps more critical is for iterative models such as GBT to be able to efficiently track metric on a eval dataset. Under current MLJ design, such metric must be compute from scratch each time the evaluation is desired, which can significantly alter performance since inference from the full N trees must be computed each time instead of just the residual ones since last eval. I’m not clear how best to support such tracking, perhaps optional deval / x_eval / y_eval kwargs to fit could do it?

Maybe related to the above, but I’m not sure to grasp well the difference in scope of update! and ingest! mentionned in LearnAPI. I’m wondering if a single fit! could be both sufficient and lighten the API verbs.

Finally, although more of a distraction regarding terminology tastes, regarding Options in RidgeRegressorOptions, I’m wondering if something along HyperParams / Hyper would be closer to their meaning?

Thanks @ablaom for bringing this discussion and all your work on this important structuring project!

6 Likes

Very nice comments and suggestions. I will try to read the proposal carefully over the week to provide additional feedback as well.

One thing that I really dislike about the current state of affairs is the inconsistence with output types. Models which are probabilistic may output distributions, categories, integers, etc. and that is a pain to post-process. Moreover, developers of models are lost with so much flexibility and end up choosing whatever they feel is more natural. As a result, end-users struggle to write generic scripts that accept any kind of model. I think we should try to reduce the noise with output types as much as possible, and make sure that API functions like predict have a well-known (and fixed) output scientific type. We can then add extra API functions like predictprob for models that support variations of the output.

I also remember having lots of issues with the unsupervised learning API. I really think we should stop treating clustering + dimension reduction + whatever pre-processing as this generic bag of “unsupervised” models. It doesn’t help when an API is designed for such a broad set of concepts with different use cases.

For pre-processing, I would like to shamelessly promote the TableTransforms.jl package as the way forward. It has tons of transforms already implemented in Julia with convenient syntax that works with general Tables.jl. It provides a simple apply API function to convert the features to whatever format needed in a pipeline with parallel branches, etc. My goal is to plug the LearnAPI.jl low-level functions at the end of pipelines to construct something more reusable and feature-rich.

Also, please avoid macros if they are still present in LearnAPI.jl. I didn’t find them in the docs, but would like to just point this out before it is too late. And finally, try to remove the Pkg manager from the end-user experience. MLJ.jl is currently very hard to teach to beginners because whenever they need to try a simple model on a simple table, they are prompted with tons of dependency requests related to the @load macro. Users should be able to install a single package with all necessary dependencies, curated and maintained by the same group of people instead of tiny packages that just wrap other third-party packages and implement the API.

Ideal future scenario:

  1. Beginner learns about Julia and is interested in ML
  2. Finds out this central package with curated list of deps
  3. Installs this package with Pkg.add(“MLStack.jl”)
  4. fit/predict works without additional fuzz with Pkg manager
7 Likes

Really great initiative. I’m pleased to see the rather different nature of probabilistic models considered. Outputting predictions from sample based posteriors is quite different than a point estimate.

How wide is the scope of this API? Is it primarily statistical models or would it also include other types like symbolic regression etc?

1 Like

Thanks @ablaom for reflecting on this.

I also agree with @CameronBieganek. In the search for something lighter and simpler than MLJ I’ve found Invenia’s Models.jl quite useful, which nicely separates untrained models (templates) from trained models. Some of the ideas there may be helpful in this endeavour?

Also, a project that I’m involved in has models being written to disk/database after training, then read into memory in several separate processes (in parallel) to be used (predicted) in long-running simulations. That is, each model is trained in 1 process and subsequently used in several parallel processes. Ideally the serialisation format would use something not specific to Julia, such as JSON or something similar. You’ve probably got this use case covered, but worth mentioning anyway.

4 Likes

For cases like that, are there hyper-hyper-parameters that are constant and can be stored with the model? Like the specification of a learning rate schedule? Something like this, maybe:

schedule = ReduceLearningRateOnPlateau(args...)

It would probably be worth adding a minify method to the API to take a full model object and return only the parts necessary for making predictions. Here’s the ridge regressor example with minify added:

Ridge regressor with minify method
using LearnAPI
using Tables

struct RidgeRegressorOptions
    lambda::Float64
end

RidgeRegressorOptions(; lambda=0.1) = RidgeRegressorOptions(lambda)

abstract type AbstractRidgeRegressor end

struct RidgeRegressor <: AbstractRidgeRegressor
    options::RidgeRegressorOptions
    coefs::Vector{Float64}
    feature_importances::Vector{Pair{Symbol, Float64}}
end

struct RidgeRegressorPredictor <: AbstractRidgeRegressor
    coefs::Vector{Float64}
end

coefs(model::RidgeRegressor) = model.coefs
coefs(model::RidgeRegressorPredictor) = model.coefs

function minify(model::AbstractRidgeRegressor)
    RidgeRegressorPredictor(coefs(model))
end

function LearnAPI.predict(model::AbstractRidgeRegressor, Xnew)
    Tables.matrix(Xnew) * coefs(model)
end

Hyper is not too bad. I’ve considered Params in the past, but it seems too generic of a word and I don’t like the abbreviation. HyperParams or HyperParameters leads to type names that are rather long. I still prefer Options. In fact, I stole the idea for Options from a Rust random forest package.

I do think it would be nice to have a curated set of models like scikit-learn. It’s not ideal for new users to have to choose between three different implementations of the same model.


I generally prefer a functional, immutable style whenever possible, but it might be worth considering an alternative design that relies on mutability:

  • All the information related to a model is stored in one mutable struct, e.g. RidgeRegressor.
  • The two primary methods are fit! and predict (there is no fit method).

This is analogous to how scikit-learn works. (Not saying that’s a good thing—just a point of reference.) The ridge regressor struct would look something like this:

mutable struct RidgeRegressor
    lambda::Float64
    coefs::Union{Nothing, Vector{Float64}}
    feature_importances::Union{Nothing, Vector{Pair{Symbol, Float64}}}
end

function RidgeRegressor(; lambda=0.1)
    RidgeRegressor(lambda, nothing, nothing)
end

One of the main advantages of this approach is that there is only one type for each model—there’s no dichotomy between RidgeRegressorOptions and RidgeRegressor.

@CameronBieganek @juliohm @jeremiedb @jocklawrie @DoctorMike

Thanks for the feedback. I respect your expert advice - it is very much appreciated. I’m going to reflect on your suggestions and post something after the weekend.

6 Likes

I also prefer the functional immutable style. It is much more natural in Julia and has tons of benefits when it comes to pipelines that store the minimum amount of information needed to reproduce the work (the RidgeRegressionOptions).

I believe that mixing up mutability with immutability will lead to serious design problems (learn from Flux.jl). Saving fields of structs as undef states to be mutated later is cumbersome and will probably introduce design challenges down the line.

4 Likes

Coming back to this proposal after a second look into the docs…

My suggestion is to start simple with the core required set of API functions that all supervised models must implement. If this is fit and predict, then we make sure that they have signatures that are simple and Julian with keyword arguments.

After that, we need to add a list of trait functions to the API to test if models implement “optional” functions. For example, isprobabilistic(model) could be used before running a new function predictprob for probabilistic predictions.

Given a well-designed set of trait functions for supervised models, I then suggest stepping back for a while to discuss the unsupervised cases such as “clustering”.

I feel that the LearnAPI.jl documentation (like the MLJ.jl documentation) is overly complicated. If we can keep it simple and to the point, we can evolve things more quickly as a community. It also starts with a statement that I disagree:

Machine learning algorithms, also called models , have a complicated taxonomy. Grouping models, or modeling tasks, into a relatively small number of types, such as “classifier” and “clusterer”, and attempting to impose uniform behavior within each group, is challenging.

Machine learning algorithms are not the same concept as machine learning models. A learning algorithm is used to learn parameters of a learning model (e.g. maximum likelihood estimation can be used learn coefficients of a linear model). These models fit in well-known categories with well-defined behavior and that is why it makes sense to design an API.

@ablaom in the case of supervised models, are there other required API functions other than fit and predict? Should we start by listing all (pure) Julia packages implementing supervised models?

I suggest that we create a wrapper package that adds implementations out there as dependencies and implements the LearnAPI.jl. This will get things going and will allow other contributions from the community. Also, should we concentrate these efforts in a single organization? JuliaML existed already when you started JuliaAI for the MLJ.jl efforts. Maybe LearnAPI.jl should be moved to JuliaML where other community members are already working? Same for ScientificTypes.jl?

2 Likes

Again, thanks to all for your comments.

Maybe it is intended that this interface is not the interface that the end user will
see. Maybe it is intended that a higher-level ML API will be built on top of this
interface.

Yes, @CameronBieganek, this is indeed the intended purpose.

There are three key stake holders in the design of a low level machine learning API:

  • developer, which here means someone adding model-generic functionality, such as
    hyperparameter optimization, model search, iterative model control, etc.

  • implementer, someone implementing the interface for some existing model, because
    they want to buy into the functionality made available by some developer. The existing
    model will have a “native” interface likely designed independently of the low level API

  • user, someone who interacts directly with a model, and not through some high level
    interface.

The existing design prioritizes the concerns of the developer and the implementer. The
result, I completely agree, is an API falling short of the requirements for direct user
interaction. The suggested API of @CameronBieganek is undoubtedly superior for this. But
from the point of view of the implementer/developer it is suboptimal because it requires
new methods (a method to extract learned parameters from the
options-fitted-parameter-report conglomerate, possibly another to extract the report) in
addition to extra gymnastics to ensure predict dispatches on the conglomerate and the
stripped down learned parameters.

The idea of making fit! mutating has been objected to by @juliohm, and I would also
prefer to avoid this.

It seems to me that adoption of the API depends critically on implementation being as
simple as possible (a minimum of compulsory methods) and as unobtrusive as possible (a
minimum of new structs, no abstract types to subtype). For comparison, Tables.jl is a
low-level API with wide adoption but it is not particularly user-friendly (think of
extracting column names).

But that doesn’t feel right to me, because LearnAPI is claiming some very fundamental
names like fit and predict.

Perhaps it was not clear, but these names are not exported by LearnAPI and not intended to
be exported by any package, or overloaded, except by model implementations.

@CameronBieganek, @jeremiedb I wonder, if the API becomes more palatable if LearnAPI.jl is
bundled with a lightweight “user interface”, UserAPI ??

I’m not a fan of passing around nothing in 90% of cases because you don’t need one of
the inputs or outputs that the API specifies, e.g. state = nothing for a model fit or
report = nothing for a model prediction.

I agree that a method returning nothing in most cases feels clunky, but it avoids
complicating case distinctions for the developer. One of the key pain points in MLJ
development was how to accommodate a model like DBSCAN clustering, which does not
generalize to new data (has no learned parameters) but nevertheless has output, separate
from the transformation itself (the labels), that you would like exposed to the user. To
that end, we introduced the possibility that transform (or predict) can output a
report item, but to make this non-breaking we introduced a trait to flag the fact that
transform’s output was in two parts. In the end we wound up breaking developer code we
didn’t know about, and there was understandable
objection
to
such complicated behavior in a low-level method.

I would much prefer verbosity to be a keyword argument, rather than a positional argument that occurs before X and y.

I suppose if we extend the notion of “metadata” to include verbosity, this would address
your concern. Conceptually this feels a bit of a stretch. We’d have to worry about
verbosity in every implementation of the optional data interface, which could be
annoying. Again, this feels like something we’re only doing for user-friendliness, but
I’ll consider it further.

“A model is just a container for hyper-parameters.”
This seems like a misnomer to me.

The use of the word “model” in LearnAPI.jl for the hyperparameters struct coincides with
it’s use in MLJ and its documentation. Objections to this use have been raised a few
times. I’d be happy to change it here; it’s probably too late for MLJ. I’d prefer a name
for the hyperparameters struct that is not pluralized, like “hyperparameters” or
“options”.

How about “strategy”?

I think Flux definition of a Model fits well the bill here.

@jeremiedb I disagree. The conflation of hyperparameters (learning strategy) and learned
parameters (weights) in a Flux model, while elegant, is not universally satisfactory, as I
think the existence of Lux.jl establishes.

I’d also stress the importance for the API to allow for performance first options where
one needs it. For example, not force the computation of features_importance if not
needed.

Good point. Generally in MLJ models, a hyperparameter is introduced to control whether
some non-essential output is carried out, if that is likely to incur a performance
penalty. If the the user buys out, then the accessor function (e.g., feature_importance)
could return nothing. How does that sound, @jeremiedb? Perhaps you have different
suggestion.

Perhaps more critical is for iterative models such as GBT to be able to efficiently
track metric on a eval dataset. Under current MLJ design, such metric must be compute
from scratch each time the evaluation is desired, which can significantly alter
performance since inference from the full N trees must be computed each time instead of
just the residual ones since last eval. I’m not clear how best to support such tracking,
perhaps optional deval / x_eval / y_eval kwargs to fit could do it?

Here @jeremiedb is referring to the kind of external control of iterative models
implemented by MLJ’s
IteratedModel,
using out-of-sample estimates of model performance for early stopping, for example.

This interesting use case sounds specific to ensemble models, but I think we can handle it
using the proposed API if we add one accessor function. First, we regard the evaluation
data as “metadata” (because it is not itself going to be sub-sampled, so is not “data” in
the LearnAPI.jl sense) and so is specified in fit, as suggested, using keyword
arguments. This provides an interface point for the evaluation data. But the external
controller also needs access to the internally computed predictions on the evaluation set,
which we provide by adding an the (optional) LearnAPI accessor function
out_of_sample_predictions(model, state, report). We arrange fit to record the
individual atomic model predictions in state and our new accessor function returns the
complete ensemble prediction (or nothing if evaluation data has not been provided). In
the event out_of_sample_predictions is not implemented (is not flagged in the
LearnAPI.functions trait) or it returns nothing, the external controller computes the
out-of-sample predictions externally “from scratch”.

How does that sound, @jerimiedb?

Maybe related to the above, but I’m not sure to grasp well the difference in scope of
update! and ingest! mentioned in LearnAPI. I’m wondering if a single fit! could be both
sufficient and lighten the API verbs.

  • update! is for hyperparameter-only changes (adding iterations; training data has not
    changed)

  • ingest! is training using new data without re-initializing learned parameters

Conceptually these strike me as different, so separate verbs are appropriate, no?

One thing that I really dislike about the current state of affairs is the inconsistence
with output types. Models which are probabilistic may output distributions, categories,
integers, etc. and that is a pain to post-process. Moreover, developers of models are
lost with so much flexibility and end up choosing whatever they feel is more natural. As
a result, end-users struggle to write generic scripts that accept any kind of model.

@juliohm I agree, but I’m suggesting that the responsibility for nailing down the allowed
representation should be a higher level interface (such as MLJ,
which indeed tries to do this). For example, such an interface could require that if the
predict_proxy_type is LearnAPI.Distribution() then the output of predict must
support the pdf method from Distributions.

Part of the problem is that agreement about “best representations” of data is still a bit
fluid in Julia. So I’m reluctant to lock this in at this low-level. What is provided are
traits to articulate what the model output actually looks like, either in terms of
scientific types, ordinary types, or individual observation types/scitypes.

What do others think about this?

I think we should try to reduce the noise with output types as much as possible, and
make sure that API functions like predict have a well-known (and fixed) output
scientific type. We can then add extra API functions like predictprob for models that
support variations of the output.

I’m not sure I properly understand this part. Are you suggesting that:

  1. Every model that computes a proxy for the target (such as a probability
    distribution, confidence interval, survival probability, etc) should be required to also
    compute actual target values; and

  2. predict should be reserved for actual target predictions and not the target proxy?

The problem I have here is that computing actual target values may require secondary
computations, and new input from the user. For example, in probabilistic programming, it
is common to return a “sampleable” object representing a probabilistic target, in lieu of a
concrete target prediction. To get a point value requires sampling the object. Also, we
need to decide, in any kind of probabilistic predictor, whether we want the mode, median,
or mean; or maybe we should apply a probability threshold, to be learned using evaluation
data; how may random samples do we take from our sampleable object? and so forth.

One limitation of the current proposal, which may be related to semantic concerns you have (I’m guessing here) is that a
model can only predict one kind of target proxy. And this seems reasonable, since most
models have a single proxy type as the object of their computation; everything else is
generally post-processing. However, if it would create less cognitive dissonance, we could
make the target proxy type an argument of predict, with a model having the option to
implement more than one:

LearnAPI.predict(model::MyModel, fitted_params, Xnew, ::LearnAPI.TrueTarget) -> actual target
LearnAPI.predict(model::MyModel, fitted_params, Xnew, ::LearnAPI.Distribution) -> probabilistic prediction

Or, as in python, we could have a plethora of dedicated operations,
predict_distribution, predict_survival_probability, etc - one for each of the 16
different proxy types already identified
here and
growing. Would others prefer this?

Also, please avoid macros if they are still present in LearnAPI.jl. I didn’t find them
in the docs, but would like to just point this out before it is too late.

@juliohm There is indeed a convenience macro, @trait (the only exported name) which provides
a shorthand for declaring traits. There’s an example
here;
the code is
here.
It seems innocuous enough to me. What do others think?

How wide is the scope of this API? Is it primarily statistical models or would it also
include other types like symbolic regression and so forth?

@DoktorMike, you can get a rough idea of the intended scope from this
list
(will
likely be extended). I think symbolic regression would be fine.

In the search for something lighter and simpler than
MLJ I’ve found Invenia’s Models.jl 14 quite useful, which nicely separates untrained
models (templates) from trained models. Some of the ideas there may be helpful in this
endeavour?

Models.jl is nice but it appears to require that models subtype a Models.jl abstract type
and we are trying to avoid that. Also, it provides only a single “operation” predict,
while I have found it useful and natural to have a transform, inverse_transform
methods as well. This is one of features of sk-learn that I like.

Also, a project that I’m involved in has models being written to disk/database after
training, then read into memory in several separate processes (in parallel) to be used
(predicted) in long-running simulations. That is, each model is trained in 1 process and
subsequently used in several parallel processes. Ideally the serialization format would
use something not specific to Julia, such as JSON or something similar. You’ve probably
got this use case covered, but worth mentioning anyway.

@jocklawrie Mmmm. My feeling is that responsibility for serialization should live at a
higher level. What is missing, but planned, is a model-specific method to convert the
fitted_params to a “serializable” form, by which I mean a form that is persistent
(not, for example, a C pointer) and anonymized. For most models fitted_params
is already serializable, but this is not universally the case. And then there would be
method to restore a deserialized object to a form needed by predict, etc.

It [the LearnAPI.jl documenation] also starts with a statement that I disagree: “Machine
learning algorithms, also called models , have a complicated taxonomy. Grouping models,
or modeling tasks, into a relatively small number of types, such as “classifier” and
“clusterer”, and attempting to impose uniform behavior within each group, is
challenging.” Machine learning algorithms are not the same concept as machine learning
models. A learning algorithm is used to learn parameters of a learning model
(e.g. maximum likelihood estimation can be used learn coefficients of a linear model).

I’m happy to stand corrected on the distinction between models and algorithms. But
otherwise I stand by the opening statement. This is the central point really.

These models fit in well-known categories with well-defined behavior

@juliohm This may be so, but the number of such categories is very large. For example,
not all clusterers are the same. Some generalize to new data (and will implement fit)
but some don’t; most compute ordinary labels (predict_proxy will have the value
LabelAmbiguous()) but some predict “soft” (probabilistic) labels (predict_proxy will
be LabelAmbiguousDistribution()). It may ultimately be useful to define Clusterer as
a LearnAPI model with behavior varying within such-and-such bounds (articulated via
LearnAPI traits) but I don’t think this should happen in LearnAPI itself.

in the case of supervised models, are there other required API functions other than fit and predict?

Yes, since a supervised model has the concept of a target variable, and predict is
outputting the target or target proxy, you should make a predict_proxy trait declaration
(see here and here) and a position_of_target declaration (see here).

But that’s it. Finally, you must declare which methods you have explicitly overloaded (the functions trait). Optional traits include promises about the scitype of training data
(target_fit_scitype trait) or whether per-observation weights are supported
(position_of_weights trait).

4 Likes

I’m not too bothered by pluralization. “Strategy” is ok, but I think “options” is the most accurate plain-English description of what the struct is. Not to mention it’s a lot shorter than “hyperparameters”.

This seems like the cleanest and most Julian way to handle the varieties of prediction output types.

2 Likes

I’m not too bothered by pluralization. “Strategy” is ok, but I think “options” is the most accurate plain-English description of what the struct is. Not to mention it’s a lot shorter than “hyperparameters”.

In addition to code, I’m thinking how the name works into documentation. My feeling is that “options” is too vague absent context. You might as well say “fields” or “properties”. So I’d want to say “model options” or something like that. But that’s two words…

This seems like the cleanest and most Julian way to handle the varieties of prediction output types.

Okay, I think we can make that work.

I am confused by the statement that “hyperparameters” could be named “strategy”. A neural network hyperparameter could be the number of layers, which has nothing to do strategy, right?

I don’t have a problem with considering the choice of number of layers as part of the learning “strategy”. Strategy just means “plan”. So, “I plan to build a model by training a neural network on the data using six alternating convolution / maxpool layers and such-and-such activation functions”. That’s my plan=strategy. Do you see a problematic conflict with another use of “strategy” in ML? Perhaps you have a different suggestion?

Side note. I just checked to Lux.jl to see what they call an abstract (uninitialised) neural network (an example of what we are seeking to name). They call it a “model”. Ughh!

1 Like

Opened an issue: Accounting for data objects that only iterate.

1 Like

Opened an issue: Allow multiple target proxy support with predict dispatching on the proxy.

1 Like

Opened an issue: Don’t use “model” to describe hyperparameter struct.

1 Like

I’ve taken some time to look at this after having forgotten about it. First of all, thanks to @ablaom for taking the time to write such a detailed proposal. I believe that the Julia ML ecosystem is still young and small enough that if this is done right, LearnAPI.jl can become as ubiquitous as Tables.jl, which I consider a resounding success.

For the most part, this is very similar to the existing MLJ interface. In my opinion this interface has worked extremely well for model construction, training and testing, which is gratifying especially considering that it has to wrap so many inequivalent things which often have their own API’s which are badly at odds with each other.

My main concern here is the aspect of the existing MLJ API which I do not think has worked so well, and has been mentioned somewhat in the comments above: the API surrounding model inputs and outputs. Indeed, because of the aforementioned inconsistency in models, this is quite hard to deal with, and it seems that MLJ took the approach of being as agnostic as possible, which was certainly a reasonable first attempt.

It seems that in LearnAPI this is mainly addressed via getobs and reformat. Some comments around these:

  • It is imperative that LearnAPI uses the getobs which is compatible with Tables.jl and MLUtils.jl.
  • Inputs (and perhaps outputs) should have at least one required method, perhaps getobs if they are not required to be a specific interface type like AbstractArray. Otherwise it seems very hard to discover what kind of interface they should have. Different models need to co-exist with each other, so it seems insufficient for each to have it’s own, completely independent, interface rules. (Note that MLUtils.getobs has reasonable fall-backs for both arrays and tables.)
  • One of the biggest obstacles I have encountered when using MLJ is that for some types of inputs, while a standard tabular interface could work fine in the abstract, there are important performance considerations which make this impractical. The best examples are when say m of the n features come from a matrix which should be kept intact for efficiency. As a rather contrived example, suppose I have an input of N data points that consists of an m\times N matrix A together with a set of n -m categorical features in a table and that the first step (perhaps a PCA or other dimensional reduction) is the multiplication MA. We really, really do not want to break up A prior to the multiplication by, for example, allocating a new array for each of the N data points. Note that the most obvious implementation of getobs would return some kind of object where elements of the columns of A are mixed up with the other categorical features. This is definitely not what we want. I don’t know what the solution to this is. Maybe you see this as something that reformat should do… but if that’s the case I think we need (at the very least) clear documentation of exactly what that would look like. The possible solutions that occur to me mostly involve having an optional getfeaturesubset (or something with a less terrible name) method, but what that would have to look like particularly when you don’t know what the input looks like I don’t know.
  • Reformat potentially puts a significant burden on the implementer. The fewer constraints (required methods) on the inputs, the larger this burden.
  • It might be nice if a default fall-back of reformat has a reformattable and reformatmatrix (or something along those lines) with a default implementation that’s something like Tables.istable(X) ? reformattable(X) : reformatmatrix(X).
4 Likes