Data normalization with NaN values in StatsBase

Jack · January 2, 2023, 8:22pm

Greetings,

I have been trying to do some basic normalization with ZScore transform like in this StatsBase package, and found that the package wouldn’t handle NaN values properly.
For example

Random.seed!(1234)
input = rand(4,5)
input[2,2] = NaN
input[3,4] = NaN
dt = StatsBase.fit(ZScoreTransform, input; dims=1)
input_normalized = StatsBase.transform(dt, input)

the input is

 0.579862     0.520355  0.789764    0.711389  0.131026
 0.411294   NaN         0.696041    0.103929  0.946453
 0.972136     0.839622  0.566704  NaN         0.574323
 0.0149088    0.967143  0.536369    0.870539  0.67765

the output is

  0.214999  NaN   1.21235   NaN  -1.33013
 -0.209818  NaN   0.415227  NaN   1.073
  1.20359   NaN  -0.684786  NaN  -0.0236933
 -1.20877   NaN  -0.942793  NaN   0.280818

ideally the NaN values in the input should be ignored, and the output should still have valid values on the [3,2] index. Of course it’s possible to hand written a normalization function, but a package should give an option to do so for applying the method in a generalized manner.
Thank you

Jack

Dan · January 2, 2023, 10:51pm

Maybe you want to achieve something like this:

julia> import Base.Iterators as Itr

julia> StatsBase.transform(
  ZScoreTransform(size(input, 2), 1, # <- dims = 1 (use eachcol)
    collect.(zip(mean_and_std.(Itr.map(x->Itr.filter(!isnan,x),eachcol(input)))...))...),
  input
)
4×5 Matrix{Float64}:
 -0.676287    0.937071   1.47177    -1.15101   -1.35379
  1.46976   NaN         -0.756893    0.655378  -0.157492
 -0.580636   -1.05285   -0.409823  NaN          0.77077
 -0.212838    0.115779  -0.305053    0.495633   0.740512

But really, those NaNs should probably be dealt with otherwise. In a clean pipeline they shouldn’t appear or should be replaced by missing perhaps.

Topic		Replies	Views
Standardize dataset with StatsBase Machine Learning	1	1009	April 4, 2020
How to standardize arrays in Julia? Statistics question	7	3334	March 8, 2022
Is it possible to unnormalise data in Flux.jl? New to Julia flux , machine-learning	6	1033	July 13, 2021
NaN-aware `imresize` General Usage question	3	71	July 26, 2024
Normalization and Linear Model NaN error? Statistics	3	1023	December 3, 2021

Data normalization with NaN values in StatsBase

Related topics