Calculating a standardization function from scratch

PyDataBlog · February 3, 2020, 2:42pm

Hi, I am trying to create a standardization function from scratch to scale features:

Here’s my current implementation:

# Standardize training features
function scale_features(X)
    μ = mean(X, dims=1)
    σ = std(X, dims=1)

    X_norm = (X .- μ) ./ σ

    return (X_norm, μ, σ)
end


# Scale the testing features using the learned parameters
function transform_features(X, μ, σ)
    X_norm = (X .- μ) ./ σ
    return X_norm
end

MWE:
scale_features([1, -2.2, 3])

results:
([0.15249857033260472, -1.0674899923282326, 0.9149914219956281], [0.5999999999999999], [2.6229754097208002])

I am getting different results from pre-made functions in sklearn with Python:

import numpy as np 
from sklearn.preprocessing import StandardScaler


x = np.array([1, -2.2, 3]).reshape(-1, 1)   

StandardScaler().fit_transform(x)

# results:
array([[ 0.18677184],
       [-1.30740289],
       [ 1.12063105]])

What am I doing wrong with my implementation? Thanks in advance.

kristoffer.carlsson · February 3, 2020, 2:59pm

Please see point 3 in Please read: make it easier to help you.

In other words, please also post the input data, and the “pre-made functions” you call.

ssfrr · February 3, 2020, 3:00pm

Can you clarify which pre-made functions you’re comparing to? A full runnable example that shows where what you’re seeing doesn’t match your expectations would be helpful.

One possibly-common gotcha is what orientation is your data in? (i.e. if you have 1000 samples of 8D data, is it a 1000x8 matrix or a 8x1000 matrix)?

PyDataBlog · February 3, 2020, 3:09pm

Updated the question with MWE.

ssfrr · February 3, 2020, 3:25pm

From the docs of std:

The algorithm returns an estimator of the generative distribution’s standard deviation under the assumption that each entry of itr is an IID drawn from that generative distribution. For arrays, this computation is equivalent to calculating sqrt(sum((itr .- mean(itr)).^2) / (length(itr) - 1)). If corrected is true, then the sum is scaled with n-1, whereas the sum is scaled with n if corrected is false with n the number of elements in itr.

std by default does a “corrected” standard deviation. This is what you want if the mean of your data is estimated by taking the average of the data. If you know the mean a-priori and subtract it off manually or provide it with the mean argument, then you should give corrected=false to std.

So I think your implementation is doing the right thing and sklearn is not.

Tamas_Papp · February 3, 2020, 3:55pm

Standardization is just a simple procedure with two major benefits: easier interpretation and better numerical properties (of the likelihood etc, it’s a simple preconditioning).

There is no single canonical way to do it, either using the uncorrected or the corrected std is fine, as long as it is used consistently.

There are also alternative approaches, eg using two standard deviations. Which are also fine.

ssfrr · February 3, 2020, 4:18pm

That’s fair, saying the sklearn is doing the wrong thing was overly harsh. Also it matters less and less as you get more data.