What does MLJ.save really save?

I am running a MLJ pipeline on a quite large dataset and I want to be able to save the model once, and then use it in later sessions without having to retrain it. I have tried following the documentation regarding saving models and that works fine. However, I am quite puzzled at the fact that the size of the saved model changes depending on the size of the dataset.

A MWE below:

using DataFrames
using MLJ

function get_data(N_rows)
    df = DataFrame(rand(N_rows, 3), :auto)
    df.x4 = rand(["A", "B", "C"], N_rows)
    df.y = rand([0, 1], N_rows)
    df = coerce(df, :y => Binary, :x4 => Multiclass)
    X = select(df, Not(:y))
    y = df.y
    return X, y
end

N_rows = 1000
X, y = get_data(N_rows);

LogReg = @load LogisticClassifier pkg = MLJLinearModels
pipe_logreg = @pipeline(OneHotEncoder, LogReg(penalty = :none, fit_intercept = false),)
mach = machine(pipe_logreg, X, y)
fit!(mach)
MLJ.save("pipeline_test_$N_rows.jlso", mach)

The size of the saved model is 227 KB when N_rows = 1_000 and 1.7 MB when N_rows = 10_000, yet they both save the same number of parameters (3 numerical coefficients and 3 coefficients for the categories). Or, that’s at least what I expect, but why does the size really change so drastically with more data? I do not expect MLJ to save the input data as well as that would be quite unintuitive.

This is run with Julia v1.7.0, MLJ v0.16.11, MLJBase v0.18.26, and MLJLinearModels v0.5.7.

Cheers!

It looks like it does save the data. If you do dump(mach) or propertynames(mach) you can see that data and resampled_data are both part of mach. I am not a user of MLJ so I do not know if there is a way to tell MLJ.save to empty those fields, but perhaps there is or they might be willing to accept a PR to do something like that?

Also you need using MLJLinearModels in your MWE.

I’ll try to make an issue on their Github to see if I can find someone who’s deeper into the mechanics. Thanks for your answer!

@wc4wc4wc4 Thanks for your query.

An MLJ machine caches training data by default to prioritise speed over memory. Now this cached data is not part of the learned parameters and is not saved when serialising. (By the way, you can turn the caching off at construction with machine(model, data...; cache=false)).

However, if your model is a composite model - by which a I mean a model that is implemented by “exporting” a learning network, then the learned parameters of that model may include training data, because the learned parameters is essentially the whole learning network, which includes machines that may include cached data. Roughly speaking, we can say that training data is being cached at internal nodes of the network (although not at the source nodes which are always emptied after each fit!). This is not good for serialization, and there is work underway to fix this.

However, for piplelines you don’t have to wait for a fix, if you use the newer (non-macro) version of pipelines (docs) where there is an option to turn caching off for all machines in the pipeline. In your example above, do

pipe_logreg = Pipeline(OneHotEncoder,
                       LogReg(penalty = :none, fit_intercept = false),
                       cache=false)
mach = machine(pipe_logreg, X, y)
fit!(mach)
MLJ.save("pipeline_test_$N_rows.jlso", mach)

Hope that helps.

4 Likes

Thank you so much, @ablaom! That seems like the solution to my problem.

Is the macro version of pipeline an old deprecated form? I was pretty sure I based it on officiel MLJ documentation, but I might remember wrongly.

@wc4wc4wc4 Your welcome.

@pipeline is now deprecated and the docs should be up-to-date, as of the MLJ 0.17 release (about 29th December). If you find documentation suggesting the use of @pipeline please let me know and I will fix.

1 Like

I see now that it was because I started my project in December that I was using the old docs. The new docs seem to work fine and are updated!

However, and this is only a minor problem, since it’s not my main model, but it doesn’t seem to be working with the LinearBinaryClassifier from MLJGLMInterface:

using MLJGLMInterface

pipe_GLM = Pipeline(
    OneHotEncoder(drop_last = true),
    x -> table(Matrix(x)),
    LinearBinaryClassifier(fit_intercept = true),
    cache = false,
)

mach = machine(pipe_GLM, X, y)
fit!(mach)
MLJ.save("pipeline_test_GLM_$N_rows.jlso", mach)

In this case, the saved machine take up 0.5 MB of space when N_rows = 1_000 and 5MB when N_rows = 10_000. Am I missing something here or is this a bug in MLJGLMInterface?

Actually, machines cache more than data. They also cache views of the data, which are encoded as vectors of integers representing which rows (observations) were used in the last training event. I expect this is the explanation.

I will make a note to clear this extra information in the new serialization PR.

1 Like

Is it possible to get around this? Because this means that the size of the GLM pipeline machines take up several gigabytes on my real data.

Thats way too big. Okay, I little investigation shows there is something else going on that is specific to GLM. See this issue.

Great, I’ll follow the Github issue then!

@wc4wc4wc4 Thanks to @samuel_okon, the recently released version of MLJGLMInterface 0.3.0 should now resolve your GLM memory issue.

1 Like

Thanks a lot, @ablaom!