Replacing Nan values with the average column values

Hi, I am trying to convert some code from Python to Julia and wondering what is the best way to replace Nan values with the column averages. I’m trying to accomplish exactly what is in the post but in Julia.

My first guess was to implement it this way, but it very clearly does not work.

eeg = replace!(eeg, NaN=>mean(eeg, dims=2))

MethodError: Cannot `convert` an object of type Transpose{Float64, Matrix{Float64}} to an object of type Float64

Thank you!

This is exactly what BetaMl.FeatureBasedImputer does.

Here an extract from it’s fit! code:

nR,nC = size(X)
missingMask = ismissing.(X)
overallStat = mean(skipmissing(X))
statf       = imputer.hpar.statistic # mean by default
cStats      = [sum(ismissing.(X[:,i])) == nR ? overallStat : statf(skipmissing(X[:,i])) for i in 1:nC]
XΜ‚ = [missingMask[r,c] ? cStats[c] : X[r,c] for r in 1:nR, c in 1:nC]

Edit: sorry, I see you want to replace NaN values, not missing ones, but the logic is the same.

The problem in your code is that what you are trying to replace is an Array, not the specific column mean of the elemebt to replace.
Of course you can always use loops, they aren’t slow in Julia.

2 Likes

Another method would be to define a helper function:

function impute_mean!(v)
    m = mean(Iterators.filter(!isnan, v))
    v[isnan.(v)] .= m
    v
end

and then suppose we have a DataFrame:

julia> df = DataFrame(a = Float64[1,2,3,NaN, NaN, 4,5])
7Γ—1 DataFrame
 Row β”‚ a       
     β”‚ Float64 
─────┼─────────
   1 β”‚     1.0
   2 β”‚     2.0
   3 β”‚     3.0
   4 β”‚   NaN
   5 β”‚   NaN
   6 β”‚     4.0
   7 β”‚     5.0

we can:

julia> foreach(x->eltype(x)<:AbstractFloat && impute_mean!(x), eachcol(df))

julia> df
7Γ—1 DataFrame
 Row β”‚ a       
     β”‚ Float64 
─────┼─────────
   1 β”‚     1.0
   2 β”‚     2.0
   3 β”‚     3.0
   4 β”‚     3.0
   5 β”‚     3.0
   6 β”‚     4.0
   7 β”‚     5.0

which will fill all the columns. Using the function will be convenient for specific columns as well.

4 Likes

Thank you for this solution, it works for me!

I’m not sure you want to replace with the mean (let alone zero), but since you asked, and Julia/DataFrames.jl is competing with pandas it should be as easy, and discoverable.

One option is removing rows, maybe most sensible, should be the default option I guess for some function, another option could be the mean, and least sensible to me seems to be with zeros, but should also be an option since:

Methods to replace NaN values with zeros in Pandas DataFrame:

  • fillna()

In pandas 0 seems the default, and at least to me it seems wrong, so the default should be something else, and only 0 or other value allowed with a keyword option. I would also look into (and Julia equivalent):
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html#pandas.DataFrame.interpolate

From the importation Wikipedia article:

Another imputation technique involves replacing any missing value with the mean of that variable for all other cases, which has the benefit of not changing the sample mean for that variable. However, mean imputation attenuates any correlations involving the variable(s) that are imputed. […]

Non-negative matrix factorization

Non-negative matrix factorization (NMF) can take missing data while minimizing its cost function, rather than treating these missing data as zeros that could introduce biases.[9] This makes it a mathematically proven method for data imputation. […]

Regression

Regression imputation has the opposite problem of mean imputation. […]

Multiple imputation

In order to deal with the problem of increased noise due to imputation, Rubin (1987)[10] developed a method for averaging the outcomes across multiple imputed data sets to account for this. […]

Just as there are multiple methods of single imputation, there are multiple methods of multiple imputation as well. One advantage that multiple imputation has over the single imputation and complete case methods is that multiple imputation is flexible and can be used in a wide variety of scenarios. Multiple imputation can be used in cases where the data are missing completely at random, missing at random, and even when the data are missing not at random

1 Like