I have built ML pipelines using MLJ to predict on live data (randomforest models for now + time serie crossval + grid search of hyperparams). It works like a charm.
However, I have a question. I have to serialize tens of individual MLJ.Machines and also to load them in RAM to serve predictions fast.
To reduce RAM footprint and accelerate the serialization/deserialization, how can I save and load only a Machine with the final model (the trained MLJ pipeline, not to loose the features engineering in the MLJ.pipeline) to apply a MLJ.predict? Without storing data from the train samples or keeping the non-final models (no need to keep all the tested models in the CV). My serialized machines weight around 25Mo for randomforests. I guess I have unnecessary content in them.
Thanks for your advice!
Thanks for giving MLJ a spin.
If you are going to serialise a machine
mach, the recommended protocol is:
bare_mach = serializable(mach).
Use your favourite serialiser on
Step 1 ensures that
bare_mach has all traces of training data removed and ensures the learned parameters have a persistent representation (relevant for the
XGBoost models, which by default only stores a C pointer).
After deserialising, to obtain
mach_deserialized, you need to call
restore! on it before you can reliably call
predict or whatever on it.
There is a shorter workflow if you are happy to use Julia’s built-in serialiser. Details and examples are here.
Does this answer your question?
Thanks for the answer. The logic is clear, but it is difficult to get rid of the training data in practice (tracking data traces in nested structures and specific to the chosen model).
I already use the shorter workflow with:
mach = machine("model.jls")
The tricky point is to find and remove nested datasets in the
machine (I found some datasets in
mach.fitresult.resampled_data and also in various forms in other fields).
An option such as
MLJ.save("model.jls, mach, predict_only=true)" would be super helpful to have a minimal machine for predictions, fast to serialize and to serve predictions behind an API on the fly.
(tracking data traces in nested structures and specific to the chosen model).
serializable is supposed to remove all data, even from nested structures. For example, if the model is a
Composite then there will be machines associated with an underlying learning network, and any data associated with those machines should also be removed. (The implementation is indeed non-trivial). If you are still seeing data, then please open an issue with minimal working example.