[ANN] MLJ: an update

A lot of work has been done on MLJ and associated repos since @ablaom’s last announcement some 7 months ago so here’s a brief update:

  • MLJ now interfaces with around 100 models including most of ScikitLearn’s,
  • the @pipeline macro makes it easy for users to define simple sequence of operations and models,
  • extensive documentation and a new dedicated repository for tutorials: MLJTutorials, which includes end-to-end examples and a port of the Introduction to Statistical Learning’s labs,
  • MLJBase now supports many useful metrics for regression and classification; it also makes it easier for package devs to work with multiple tabular data formats (via Tables.jl) and categorical features (via CategoricalArrays.jl).

For devs

MLJ can help you focus on “just” developing models and benefit from MLJ’s machinery for data pre-processing, hyper-parameter tuning, evaluation metrics, etc.

If you have or know of a Julia package that fits the idea of the fit/predict/transform, please consider adding an interface to MLJ and registering your package with MLJModels. This will allow users to discover, compare and compose (many) models.

Thanks a lot to those who’ve already been helping us out and/or have given us detailed feedback on MLJ (@samuel_okon, @nignatiadis, @juliohm, @cscherrer, @ZacLN, @jpsamaroo, and many others)

Brief roadmap

In future months we will be focusing on

  • polishing the user interface: we would like to stabilise the API for MLJ and MLJBase by the end of February and release a 1.0 for both then,
  • adding capacity for more sophisticated hyper parameter tuning,
  • improving MLJ’s support for distributed & multithreaded computing.

See also the suggested projects section for contributors if you’re interested in helping out (thanks!).

Have a nice weekend


You mentioned “sophisticated hyper parameter tuning” in the road map. What methods are you considering and/or already supporting?

Anything along the line of R’s DiceOptim for optimization with noise?

1 Like

Well anything more sophisticated than grid or basic random search (which we support). For instance we would like to try using/interfacing with Hyperopt.jl (which supports BO, fancier random search, etc). I’m not familiar with R’s DiceOptim (thanks for the link).
It would also be nice is to try to leverage appropriate tools from JuliaOpt to help for such things too.


Exciting release!

Any thoughts on streaming and distributed/out of core datasets?

It would be great to have general support in MJL.jl for such things. I have made a prototype in JLBoost.jl where the user can use JDF.JDFFile to fit gradient-boosting tree (same family as XGBoost) on larger-than-RAM datasets in an out-of-core fashion. See https://github.com/xiaodaigh/JLBoost.jl#fit-model-on-jdfjdffile---enabling-larger-than-ram-model-fit

But it’s not MLJ.jl though. Keen to bring such support to my packages once MLJ has a general framework for it.

1 Like

On the streaming side, interop with OnlineStats is considered (maybe @ablaom can comment on this).

With respect to distributed/out-of-core, MLJ is built around compatibility with the Tables.jl interface; it works with JuliaDB for instance, and it would work with anything else that corresponds to Tables.jl

What happens to the data after it leaves the table? Is mlj typed for abstract arrays or dense in memory?

Maybe it’s useful to stress that MLJ is effectively “just” a tool to use and compose other models, some are offered by MLJ but most are external. In the interface to a given package, MLJ expects a Table object by design but then the implementation of the fit function is free to do whatever it wants so that the data is in a form that can be processed by the model.
To facilitate this, we offer functions that facilitate conversion of the data if required by the package and, for instance, to matrices with or without transposition depending on what’s required.

To exemplify this, a number of algorithms in the Stats environment expect a p x n matrix as input convention; the MLJ interface can provide this as an adjoint view (the convention in MLJ is n x p); in the unfortunate case where the algorithm does not support adjoints, then a copy is needed but, fingers crossed, this will slowly be updated in MultivariateStats and the like so that we don’t have to copy data around too much if not required.

I hope this makes sense and answers your question