Save and Load Random Forest trained with MLJ/ScikitLearn.jl

I would like to train a Random Forest model (RandomForestRegressor pkg=ScikitLearn) and save the trained model to disk. Then, I would like to load the model and apply it to new data.

After training with success, I tried to save the model using JLD2. Once I read the model and apply it again I get the error

ERROR: LoadError: ArgumentError: ref of NULL PyObject

Here is my code:

using MLJ
using ScikitLearn
using MLJScikitLearnInterface
using PyCall
using JLD2

x = rand(Float32, 100, 10) # 100 training samples, 10 predictors
y = x[:,2] + x[:,4]

@MLJ.load RandomForestRegressor pkg=ScikitLearn
clf = RandomForestRegressor()

# Train the model
mach = machine(clf, x, y)
MLJ.fit!(mach, verbosity=2)
yTR_hat = MLJ.predict(mach, x)

# Save the model
@JLD2.save "model.jld2" mach

# Load the saved model and apply to new data
@JLD2.load "model.jld2" mach

# This gives me the ERROR: ArgumentError: ref of NULL PyObject
yTE_hat = MLJ.predict(mach, x)

Please, how should I do this?

Thank you very much for the help

This works using BetaML:

using BetaML.Trees
using JLD2
x = rand(Float32, 100, 10) # 100 training samples, 10 predictors
y = x[:,2] + x[:,4]
myForest = buildForest(x,y,100)
yhat = Trees.predict(myForest, x)
# Save the model
@JLD2.save "model.jld2" myForest
# Load the saved model and apply to new data
@JLD2.load "model.jld2" myForest
yhat2 = Trees.predict(myForest, x)
1 Like

thank you very much for the suggestion @sylvaticus ! Your package looks nice, and I will definitely keep it in mind. In my case, I am using other MLJ functionalities as well (ex. parameter optimization of the Random Forest using cross-validation, not shown in the toy example above for the sake of simplicity). All works, expect the saving/loading issue. Therefore, I would love to see a solution to save/load the model created with MLJ and the ScikitLearn packages like in my toy example, if possible. Thank you again.

UPDATE: I found that using BSON instead of JLD2 saves the model ( + using MLJBase, using ScientificTypes):

using MLJ
using ScikitLearn
using MLJScikitLearnInterface
using PyCall
using BSON
using MLJBase
using ScientificTypes

x = rand(Float32, 100, 10) # 100 training samples, 10 predictors
y = x[:,2] + x[:,4]

@MLJ.load RandomForestRegressor pkg=ScikitLearn
clf = RandomForestRegressor()

# Train the model
mach = machine(clf, x, y)
MLJ.fit!(mach, verbosity=2)
yTR_hat = MLJ.predict(mach, x)

# Save the model
@BSON.save "model.bson" mach

# Load the saved model
@BSON.load "model.bson" mach

# and apply to new data. 
yTE_hat = MLJ.predict(mach, x)

The above code works fine.

But once I close the Julia section, and start a new one, and load the saved model to apply it to new data:


using MLJ
using ScikitLearn
using MLJScikitLearnInterface
using PyCall
using BSON
using MLJBase
using ScientificTypes

x = rand(Float32, 100, 10) # 100 training samples, 10 predictors
@BSON.load "model.bson" mach
yTR_hat = MLJ.predict(mach, x)

the code crashes with:

signal (11): Segmentation fault: 11
in expression starting at none:1
PyObject_GetAttrString at /Users/rio/.julia/conda/3/lib/libpython3.8.dylib (unknown line)
_getproperty at /Users/rio/.julia/packages/PyCall/tqyST/src/PyCall.jl:300 [inlined]

I am running Julia 1.5.3 in MacOS

I found that the answer is in " Saving machines" . Instead of using BSON or JLD2 to save the trained model, I should using the specific MLJ.save functionality:

# Save the model
MLJ.save("my_machine.jlso", mach)

# Load the saved model
mach2 = machine("my_machine.jlso")

# and apply to new data. This works :)
xnew = rand(Float32, 100, 10)
ynew_hat = MLJ.predict(mach2, xnew)