Hi, I am trying to create a standardization function from scratch to scale features:
Here’s my current implementation:
# Standardize training features
μ = mean(X, dims=1)
σ = std(X, dims=1)
X_norm = (X .- μ) ./ σ
return (X_norm, μ, σ)
# Scale the testing features using the learned parameters
function transform_features(X, μ, σ)
X_norm = (X .- μ) ./ σ
scale_features([1, -2.2, 3])
([0.15249857033260472, -1.0674899923282326, 0.9149914219956281], [0.5999999999999999], [2.6229754097208002])
I am getting different results from pre-made functions in sklearn with Python:
import numpy as np
from sklearn.preprocessing import StandardScaler
x = np.array([1, -2.2, 3]).reshape(-1, 1)
What am I doing wrong with my implementation? Thanks in advance.
Please see point 3 in Please read: make it easier to help you.
In other words, please also post the input data, and the “pre-made functions” you call.
Can you clarify which pre-made functions you’re comparing to? A full runnable example that shows where what you’re seeing doesn’t match your expectations would be helpful.
One possibly-common gotcha is what orientation is your data in? (i.e. if you have 1000 samples of 8D data, is it a 1000x8 matrix or a 8x1000 matrix)?
Updated the question with MWE.
From the docs of
The algorithm returns an estimator of the generative distribution’s standard deviation under the assumption that each entry of itr is an IID drawn from that generative distribution. For arrays, this computation is equivalent to calculating
sqrt(sum((itr .- mean(itr)).^2) / (length(itr) - 1)). If
true, then the sum is scaled with
n-1, whereas the sum is scaled with
n if corrected is false with
n the number of elements in
std by default does a “corrected” standard deviation. This is what you want if the mean of your data is estimated by taking the average of the data. If you know the mean a-priori and subtract it off manually or provide it with the
mean argument, then you should give
So I think your implementation is doing the right thing and
sklearn is not.
Standardization is just a simple procedure with two major benefits: easier interpretation and better numerical properties (of the likelihood etc, it’s a simple preconditioning).
There is no single canonical way to do it, either using the uncorrected or the corrected
std is fine, as long as it is used consistently.
There are also alternative approaches, eg using two standard deviations. Which are also fine.
That’s fair, saying the
sklearn is doing the wrong thing was overly harsh. Also it matters less and less as you get more data.