Saving models in MLJ - only final ones without data

julien_goo · September 7, 2022, 6:23am

Hi all,

I have built ML pipelines using MLJ to predict on live data (randomforest models for now + time serie crossval + grid search of hyperparams). It works like a charm.

However, I have a question. I have to serialize tens of individual MLJ.Machines and also to load them in RAM to serve predictions fast.

To reduce RAM footprint and accelerate the serialization/deserialization, how can I save and load only a Machine with the final model (the trained MLJ pipeline, not to loose the features engineering in the MLJ.pipeline) to apply a MLJ.predict? Without storing data from the train samples or keeping the non-final models (no need to keep all the tested models in the CV). My serialized machines weight around 25Mo for randomforests. I guess I have unnecessary content in them.

Thanks for your advice!
Ju

ablaom · September 9, 2022, 3:33am

Thanks for giving MLJ a spin.

If you are going to serialise a machine mach, the recommended protocol is:

define bare_mach = serializable(mach).
Use your favourite serialiser on bare_mach.

Step 1 ensures that bare_mach has all traces of training data removed and ensures the learned parameters have a persistent representation (relevant for the XGBoost models, which by default only stores a C pointer).

After deserialising, to obtain mach_deserialized, you need to call restore! on it before you can reliably call predict or whatever on it.

There is a shorter workflow if you are happy to use Julia’s built-in serialiser. Details and examples are here.

Does this answer your question?

julien_goo · September 9, 2022, 5:26am

Hi,
Thanks for the answer. The logic is clear, but it is difficult to get rid of the training data in practice (tracking data traces in nested structures and specific to the chosen model).
I already use the shorter workflow with:

# Serialization
MLJ.save("model.jls", mach)
# Deserialization
mach = machine("model.jls")

The tricky point is to find and remove nested datasets in the machine (I found some datasets in mach.fitresult.data, mach.fitresult.resampled_data and also in various forms in other fields).
An option such as MLJ.save("model.jls, mach, predict_only=true)" would be super helpful to have a minimal machine for predictions, fast to serialize and to serve predictions behind an API on the fly.

ablaom · September 9, 2022, 5:58am

(tracking data traces in nested structures and specific to the chosen model).

serializable is supposed to remove all data, even from nested structures. For example, if the model is a Composite then there will be machines associated with an underlying learning network, and any data associated with those machines should also be removed. (The implementation is indeed non-trivial). If you are still seeing data, then please open an issue with minimal working example.

Topic		Replies	Views
Saving multiple MLJ machine to a single file? Machine Learning	5	441	October 31, 2022
MLJ.save() and restoring with machine() don't work Machine Learning question , mlj	11	363	March 7, 2024
What does MLJ.save really save? Machine Learning mlj	12	1323	February 23, 2022
Get training data from saved machine Machine Learning question , machine-learning , mlj	1	338	August 11, 2021
Save and Load Random Forest trained with MLJ/ScikitLearn.jl Machine Learning mlj	4	1525	March 1, 2021

Saving models in MLJ - only final ones without data

Related topics