How do you train a machine on several datasets

Hello!
I’m new to the MLJ package but I gather that a fundamental aspect is that when you create a machine object you also define the data for the object. I’m in a situation where I have several very large dataframes that I would like to use to train a model and then I have separate datasets that I want to transform with the machine object to analyze them (It is about using unsupervised learning for anomaly detection).

So, can you add more training data to a machine object after it has been created and fit to the new data as well and it is possible to transform new data after fitting it?

2 Likes

Thanks @hpaldan for your query and for giving MLJ a try.

To paraphrase your questions as I understand them:

  1. Does MLJ support incremental learning (updating learned parameters based new data)?

  2. Can a machine bound to an unsupervised model, trained on data X, be used to transform new data Xnew?

Do I understand correctly?

The answer to 1. is currently no. You can add iterations to a model bound to an iterative model (eg, EvoTreesClassifier) but not new data.

The answer to 2. is yes and there are many examples around in the MLJ learning resources. Here’s another:

using MLJ

PCA = @iload PCA pkg=MultivariateStats

X, y = @load_iris # a table and a vector

model = PCA(maxoutdim=2)
mach = machine(model, X) |> fit!

Xnew = (sepal_length = [6.4, 7.2, 7.4],
        sepal_width = [2.8, 3.0, 2.8],
        petal_length = [5.6, 5.8, 6.1],
        petal_width = [2.1, 1.6, 1.9],)

# training data transformed:
transform(mach, X)

# new data transformed:
transform(mach, Xnew)

If you are transitioning from another ML platform (eg, sk-learn or R) you may find this useful: MLJ for Data Scientists in Two Hours

1 Like

Yes! Thats exactly my questions thank you! Somehow I must have missed the examples with tranforming new data,I only saw examples where partition is used. I will look through the tutorial, thanks again!