Standardize dataset with StatsBase

alequa · April 4, 2020, 7:51pm

Hi,
I am trying to apply this standardization scheme where you standardize the test set on the base of the train set.
I am trying to do this with the ZScoreTransform of StatsBase, but I encounter this trouble.

#

using Distributions
using StatsBase
#(features, datapoints)
train = rand(Normal(1,10),(100,1000))
test  = rand(Normal(1,5),(100,100))

## Verify the train dataset is normalized
println("before standardization")
mean(train,dims=2)
std(train,dims=2)
mean(test,dims=2)
std(test,dims=2)
## Train the ZScoreTransform
train_std = StatsBase.fit(ZScoreTransform, train , dims=2)
StatsBase.transform!(train_std,train)

## And each feature get standardized
mean(train,dims=2)
std(train,dims=2)

# Then I want to standardize the test set.
StatsBase.transform!(train_std,test)
# But I get Error Dimension mismatch!!!! Despite the mean and scale are with the correct size!!
size(train_std.mean)
size(train_std.scale)

Why? And how do I solve it?

ppalmes · April 4, 2020, 8:00pm

i run the code, there was no problem. maybe update to latest version?

Topic		Replies	Views
Data normalization with NaN values in StatsBase Machine Learning package	1	318	January 2, 2023
Normalization of histograms Statistics	1	813	June 6, 2019
Calculating a standardization function from scratch General Usage question	6	2030	February 3, 2020
Standardize all columns of DataFrame Data question , package	9	2741	October 2, 2023
Randomized Hypothesis Test (row-level analysis): DimensionMismatch ERROR New to Julia statistics , combinatorics	16	730	November 1, 2021

Standardize dataset with StatsBase

Related topics