[ANN] MLJ: an update

tlienart · November 29, 2019, 10:07pm

A lot of work has been done on MLJ and associated repos since @ablaom’s last announcement some 7 months ago so here’s a brief update:

MLJ now interfaces with around 100 models including most of ScikitLearn’s,
the @pipeline macro makes it easy for users to define simple sequence of operations and models,
extensive documentation and a new dedicated repository for tutorials: MLJTutorials, which includes end-to-end examples and a port of the Introduction to Statistical Learning’s labs,
MLJBase now supports many useful metrics for regression and classification; it also makes it easier for package devs to work with multiple tabular data formats (via Tables.jl) and categorical features (via CategoricalArrays.jl).

For devs

MLJ can help you focus on “just” developing models and benefit from MLJ’s machinery for data pre-processing, hyper-parameter tuning, evaluation metrics, etc.

If you have or know of a Julia package that fits the idea of the fit/predict/transform, please consider adding an interface to MLJ and registering your package with MLJModels. This will allow users to discover, compare and compose (many) models.

Thanks a lot to those who’ve already been helping us out and/or have given us detailed feedback on MLJ (@samuel_okon, @nignatiadis, @juliohm, @cscherrer, @ZacLN, @jpsamaroo, and many others)

Brief roadmap

In future months we will be focusing on

polishing the user interface: we would like to stabilise the API for MLJ and MLJBase by the end of February and release a 1.0 for both then,
adding capacity for more sophisticated hyper parameter tuning,
improving MLJ’s support for distributed & multithreaded computing.

See also the suggested projects section for contributors if you’re interested in helping out (thanks!).

Have a nice weekend

robsmith11 · November 30, 2019, 7:55am

You mentioned “sophisticated hyper parameter tuning” in the road map. What methods are you considering and/or already supporting?

Anything along the line of R’s DiceOptim for optimization with noise?

tlienart · November 30, 2019, 9:54am

Well anything more sophisticated than grid or basic random search (which we support). For instance we would like to try using/interfacing with Hyperopt.jl (which supports BO, fancier random search, etc). I’m not familiar with R’s DiceOptim (thanks for the link).
It would also be nice is to try to leverage appropriate tools from JuliaOpt to help for such things too.

Ratingulate · December 1, 2019, 3:55am

Exciting release!

Any thoughts on streaming and distributed/out of core datasets?

xiaodai · December 1, 2019, 4:04am

It would be great to have general support in MJL.jl for such things. I have made a prototype in JLBoost.jl where the user can use JDF.JDFFile to fit gradient-boosting tree (same family as XGBoost) on larger-than-RAM datasets in an out-of-core fashion. See https://github.com/xiaodaigh/JLBoost.jl#fit-model-on-jdfjdffile---enabling-larger-than-ram-model-fit

But it’s not MLJ.jl though. Keen to bring such support to my packages once MLJ has a general framework for it.

tlienart · December 1, 2019, 11:46am

On the streaming side, interop with OnlineStats is considered (maybe @ablaom can comment on this).

With respect to distributed/out-of-core, MLJ is built around compatibility with the Tables.jl interface; it works with JuliaDB for instance, and it would work with anything else that corresponds to Tables.jl

datnamer · December 1, 2019, 2:08pm

What happens to the data after it leaves the table? Is mlj typed for abstract arrays or dense in memory?

tlienart · December 1, 2019, 10:55pm

Maybe it’s useful to stress that MLJ is effectively “just” a tool to use and compose other models, some are offered by MLJ but most are external. In the interface to a given package, MLJ expects a Table object by design but then the implementation of the fit function is free to do whatever it wants so that the data is in a form that can be processed by the model.
To facilitate this, we offer functions that facilitate conversion of the data if required by the package and, for instance, to matrices with or without transposition depending on what’s required.

To exemplify this, a number of algorithms in the Stats environment expect a p x n matrix as input convention; the MLJ interface can provide this as an adjoint view (the convention in MLJ is n x p); in the unfortunate case where the algorithm does not support adjoints, then a copy is needed but, fingers crossed, this will slowly be updated in MultivariateStats and the like so that we don’t have to copy data around too much if not required.

I hope this makes sense and answers your question

Topic		Replies	Views
MLJ - A machine learning toolbox for Julia Package Announcements	0	2218	April 30, 2019
Automate training MLJ models Machine Learning machine-learning , mlj	14	2178	February 17, 2020
JuliaML organization and MLJ.jl Machine Learning	5	1490	August 19, 2019
MLJ (Machine Learning in Julia) 0.12 update Package Announcements	4	607	July 23, 2020
Online/out-of-core machine learning (ML) algorithms needs to compete with H20 & Spark Data	13	2379	March 1, 2018

[ANN] MLJ: an update

For devs

Brief roadmap

Related topics