I am running a MLJ pipeline on a quite large dataset and I want to be able to save the model once, and then use it in later sessions without having to retrain it. I have tried following the documentation regarding saving models and that works fine. However, I am quite puzzled at the fact that the size of the saved model changes depending on the size of the dataset.
A MWE below:
using DataFrames
using MLJ
function get_data(N_rows)
df = DataFrame(rand(N_rows, 3), :auto)
df.x4 = rand(["A", "B", "C"], N_rows)
df.y = rand([0, 1], N_rows)
df = coerce(df, :y => Binary, :x4 => Multiclass)
X = select(df, Not(:y))
y = df.y
return X, y
end
N_rows = 1000
X, y = get_data(N_rows);
LogReg = @load LogisticClassifier pkg = MLJLinearModels
pipe_logreg = @pipeline(OneHotEncoder, LogReg(penalty = :none, fit_intercept = false),)
mach = machine(pipe_logreg, X, y)
fit!(mach)
MLJ.save("pipeline_test_$N_rows.jlso", mach)
The size of the saved model is 227 KB when N_rows = 1_000 and 1.7 MB when N_rows = 10_000, yet they both save the same number of parameters (3 numerical coefficients and 3 coefficients for the categories). Or, that’s at least what I expect, but why does the size really change so drastically with more data? I do not expect MLJ to save the input data as well as that would be quite unintuitive.
This is run with Julia v1.7.0, MLJ v0.16.11, MLJBase v0.18.26, and MLJLinearModels v0.5.7.
It looks like it does save the data. If you do dump(mach) or propertynames(mach) you can see that data and resampled_data are both part of mach. I am not a user of MLJ so I do not know if there is a way to tell MLJ.save to empty those fields, but perhaps there is or they might be willing to accept a PR to do something like that?
An MLJ machine caches training data by default to prioritise speed over memory. Now this cached data is not part of the learned parameters and is not saved when serialising. (By the way, you can turn the caching off at construction with machine(model, data...; cache=false)).
However, if your model is a composite model - by which a I mean a model that is implemented by “exporting” a learning network, then the learned parameters of that model may include training data, because the learned parameters is essentially the whole learning network, which includes machines that may include cached data. Roughly speaking, we can say that training data is being cached at internal nodes of the network (although not at the source nodes which are always emptied after each fit!). This is not good for serialization, and there is work underway to fix this.
However, for piplelines you don’t have to wait for a fix, if you use the newer (non-macro) version of pipelines (docs) where there is an option to turn caching off for all machines in the pipeline. In your example above, do
@pipeline is now deprecated and the docs should be up-to-date, as of the MLJ 0.17 release (about 29th December). If you find documentation suggesting the use of @pipeline please let me know and I will fix.
I see now that it was because I started my project in December that I was using the old docs. The new docs seem to work fine and are updated!
However, and this is only a minor problem, since it’s not my main model, but it doesn’t seem to be working with the LinearBinaryClassifier from MLJGLMInterface:
In this case, the saved machine take up 0.5 MB of space when N_rows = 1_000 and 5MB when N_rows = 10_000. Am I missing something here or is this a bug in MLJGLMInterface?
Actually, machines cache more than data. They also cache views of the data, which are encoded as vectors of integers representing which rows (observations) were used in the last training event. I expect this is the explanation.
I will make a note to clear this extra information in the new serialization PR.