Right way of applying `inverse_transform`

Hi, I have a setup like this:

dftrain, dftest = partition(df, 0.7, shuffle=true, rng=123)

datapipe = ContinuousEncoder() |> Standardizer()
datatrans_mach = machine(datapipe, dftrain) |> fit!
normalized_train = MLJ.transform(datatrans_mach, dftrain)
normalized_test = MLJ.transform(datatrans_mach, dftest)

normalizer = fitted_params(datatrans_mach).machines[2]

ytrain, Xtrain = normalized_train.target, select(normalized_train, Not(target))
ytest, Xtest = normalized_test.target, select(normalized_test, Not(target))

knn = KNNRegressor()
knnM = machine(knn, Xtrain, ytrain) |> fit!

All well and good, but when I do

predict(knn, inverse_transform(normalizer, Xtest))

I get
ERROR: Attempting to transform data with incompatible feature labels.
So I thought that since Standardizer was trained on dftrain which includes the target I tried

predict(knn, inverse_transform(normalizer, hcat(Xtest, ytest)))

But that is evidently not it, since I get a more fundamental incompatibility ERROR: ArgumentError: dimension of input points:44 and tree data:43 must agree

And if I first predict and then inverse transform i.e.

inverse_transform(normalizer, predict(knn, Xtest))

I get ERROR: type Nothing has no field names

So, how can I get the predictions on the original scale?

What are you hoping to do?

If I understood your example, Xtrain is your normalized training dataset and Xtest is your normalized testing dataset.
So if you trained your model on Xtrain, you should be able to do predict() on Xtest like this

# get predictions for your test dataset
ytest_hat=predict(knnM, Xtest)

It is possible that your error is simply a typo, because your fitted machine is called knnM whereas your predict() call is against knn


In general, would it be possible for you to change your workflow and separate your X and y early on (ala Common MLJ Workflows )

That way your target transformations would be separate and would be easy to debug if you have any problems. I’d argue it’s the more common way, because there are transforms that could introduce leakage from target into your features, so you tend to separate those early on.

Eg, changing your code to:

y, X =  unpack(df, ==(:target), rng=123);
(Xtrain, Xtest), (ytrain, ytest)  = partition((X, y), 0.7, shuffle=true,multi=true,  rng=123)

datapipe = ContinuousEncoder() |> Standardizer()
datatrans_mach = machine(datapipe, Xtrain) |> fit!

normalized_train = MLJ.transform(datatrans_mach, Xtrain)
normalized_test = MLJ.transform(datatrans_mach, Xtest);

knn = KNNRegressor()
knnM = machine(knn, normalized_train, ytrain) |> fit!

# out of sample predictions that you can evaluate performance on
ytest_hat=predict(knnM, normalized_test)

@ctrebbau I’m not sure I get your question. But check to see if the following code does what you want. You could use whatever workflow you wish to but using the MLJ workflows @svilupp as pointed out makes things easier conceptually.

# This assumes that the name of your target feature is `:target`
# You can replace this with the actual name of your target feature
knnp = (X -> select(X, Not(:target))) |> KNNRegressor
knnM = machine(knnp, normalized_train, ytrain) |> fit!
predict(knnM, inverse_transform(normalizer, normalized_test))

Hi, thank you for your prompt help; sorry for my late response. I’m sorry I wasn’t able to explain myself more clearly. I’ve adhered more closely to the more standard workflow, separating target and features earlier, even before normalizing, and I’m happy to report I’m getting more sensible predictions now.

2 Likes