What does MLJ.save really save?

I am running a MLJ pipeline on a quite large dataset and I want to be able to save the model once, and then use it in later sessions without having to retrain it. I have tried following the documentation regarding saving models and that works fine. However, I am quite puzzled at the fact that the size of the saved model changes depending on the size of the dataset.

A MWE below:

using DataFrames
using MLJ

function get_data(N_rows)
    df = DataFrame(rand(N_rows, 3), :auto)
    df.x4 = rand(["A", "B", "C"], N_rows)
    df.y = rand([0, 1], N_rows)
    df = coerce(df, :y => Binary, :x4 => Multiclass)
    X = select(df, Not(:y))
    y = df.y
    return X, y
end

N_rows = 1000
X, y = get_data(N_rows);

LogReg = @load LogisticClassifier pkg = MLJLinearModels
pipe_logreg = @pipeline(OneHotEncoder, LogReg(penalty = :none, fit_intercept = false),)
mach = machine(pipe_logreg, X, y)
fit!(mach)
MLJ.save("pipeline_test_$N_rows.jlso", mach)

The size of the saved model is 227 KB when N_rows = 1_000 and 1.7 MB when N_rows = 10_000, yet they both save the same number of parameters (3 numerical coefficients and 3 coefficients for the categories). Or, that’s at least what I expect, but why does the size really change so drastically with more data? I do not expect MLJ to save the input data as well as that would be quite unintuitive.

This is run with Julia v1.7.0, MLJ v0.16.11, MLJBase v0.18.26, and MLJLinearModels v0.5.7.

Cheers!

It looks like it does save the data. If you do dump(mach) or propertynames(mach) you can see that data and resampled_data are both part of mach. I am not a user of MLJ so I do not know if there is a way to tell MLJ.save to empty those fields, but perhaps there is or they might be willing to accept a PR to do something like that?

Also you need using MLJLinearModels in your MWE.

I’ll try to make an issue on their Github to see if I can find someone who’s deeper into the mechanics. Thanks for your answer!

@wc4wc4wc4 Thanks for your query.

An MLJ machine caches training data by default to prioritise speed over memory. Now this cached data is not part of the learned parameters and is not saved when serialising. (By the way, you can turn the caching off at construction with machine(model, data...; cache=false)).

However, if your model is a composite model - by which a I mean a model that is implemented by “exporting” a learning network, then the learned parameters of that model may include training data, because the learned parameters is essentially the whole learning network, which includes machines that may include cached data. Roughly speaking, we can say that training data is being cached at internal nodes of the network (although not at the source nodes which are always emptied after each fit!). This is not good for serialization, and there is work underway to fix this.

However, for piplelines you don’t have to wait for a fix, if you use the newer (non-macro) version of pipelines (docs) where there is an option to turn caching off for all machines in the pipeline. In your example above, do

pipe_logreg = Pipeline(OneHotEncoder,
                       LogReg(penalty = :none, fit_intercept = false),
                       cache=false)
mach = machine(pipe_logreg, X, y)
fit!(mach)
MLJ.save("pipeline_test_$N_rows.jlso", mach)

Hope that helps.

2 Likes

Thank you so much, @ablaom! That seems like the solution to my problem.

Is the macro version of pipeline an old deprecated form? I was pretty sure I based it on officiel MLJ documentation, but I might remember wrongly.