Saving models in MLJ - only final ones without data

Hi all,

I have built ML pipelines using MLJ to predict on live data (randomforest models for now + time serie crossval + grid search of hyperparams). It works like a charm.

However, I have a question. I have to serialize tens of individual MLJ.Machines and also to load them in RAM to serve predictions fast.

To reduce RAM footprint and accelerate the serialization/deserialization, how can I save and load only a Machine with the final model (the trained MLJ pipeline, not to loose the features engineering in the MLJ.pipeline) to apply a MLJ.predict? Without storing data from the train samples or keeping the non-final models (no need to keep all the tested models in the CV). My serialized machines weight around 25Mo for randomforests. I guess I have unnecessary content in them.

Thanks for your advice!
Ju

2 Likes

Thanks for giving MLJ a spin.

If you are going to serialise a machine mach, the recommended protocol is:

  1. define bare_mach = serializable(mach).

  2. Use your favourite serialiser on bare_mach.

Step 1 ensures that bare_mach has all traces of training data removed and ensures the learned parameters have a persistent representation (relevant for the XGBoost models, which by default only stores a C pointer).

After deserialising, to obtain mach_deserialized, you need to call restore! on it before you can reliably call predict or whatever on it.

There is a shorter workflow if you are happy to use Julia’s built-in serialiser. Details and examples are here.

Does this answer your question?

1 Like

Hi,
Thanks for the answer. The logic is clear, but it is difficult to get rid of the training data in practice (tracking data traces in nested structures and specific to the chosen model).
I already use the shorter workflow with:

# Serialization
MLJ.save("model.jls", mach)
# Deserialization
mach = machine("model.jls")

The tricky point is to find and remove nested datasets in the machine (I found some datasets in mach.fitresult.data, mach.fitresult.resampled_data and also in various forms in other fields).
An option such as MLJ.save("model.jls, mach, predict_only=true)" would be super helpful to have a minimal machine for predictions, fast to serialize and to serve predictions behind an API on the fly.

(tracking data traces in nested structures and specific to the chosen model).

serializable is supposed to remove all data, even from nested structures. For example, if the model is a Composite then there will be machines associated with an underlying learning network, and any data associated with those machines should also be removed. (The implementation is indeed non-trivial). If you are still seeing data, then please open an issue with minimal working example.