Hi, I am trying to convert some code from Python to Julia and wondering what is the best way to replace Nan values with the column averages. Iβm trying to accomplish exactly what is in the post but in Julia.
My first guess was to implement it this way, but it very clearly does not work.
eeg = replace!(eeg, NaN=>mean(eeg, dims=2))
MethodError: Cannot `convert` an object of type Transpose{Float64, Matrix{Float64}} to an object of type Float64
nR,nC = size(X)
missingMask = ismissing.(X)
overallStat = mean(skipmissing(X))
statf = imputer.hpar.statistic # mean by default
cStats = [sum(ismissing.(X[:,i])) == nR ? overallStat : statf(skipmissing(X[:,i])) for i in 1:nC]
XΜ = [missingMask[r,c] ? cStats[c] : X[r,c] for r in 1:nR, c in 1:nC]
Edit: sorry, I see you want to replace NaN values, not missing ones, but the logic is the same.
The problem in your code is that what you are trying to replace is an Array, not the specific column mean of the elemebt to replace.
Of course you can always use loops, they arenβt slow in Julia.
Iβm not sure you want to replace with the mean (let alone zero), but since you asked, and Julia/DataFrames.jl is competing with pandas it should be as easy, and discoverable.
One option is removing rows, maybe most sensible, should be the default option I guess for some function, another option could be the mean, and least sensible to me seems to be with zeros, but should also be an option since:
Methods to replace NaN values with zeros in Pandas DataFrame:
Another imputation technique involves replacing any missing value with the mean of that variable for all other cases, which has the benefit of not changing the sample mean for that variable. However, mean imputation attenuates any correlations involving the variable(s) that are imputed. [β¦]
Non-negative matrix factorization
Non-negative matrix factorization (NMF) can take missing data while minimizing its cost function, rather than treating these missing data as zeros that could introduce biases.[9] This makes it a mathematically proven method for data imputation. [β¦]
Regression
Regression imputation has the opposite problem of mean imputation. [β¦]
Multiple imputation
In order to deal with the problem of increased noise due to imputation, Rubin (1987)[10] developed a method for averaging the outcomes across multiple imputed data sets to account for this. [β¦]
Just as there are multiple methods of single imputation, there are multiple methods of multiple imputation as well. One advantage that multiple imputation has over the single imputation and complete case methods is that multiple imputation is flexible and can be used in a wide variety of scenarios. Multiple imputation can be used in cases where the data are missing completely at random, missing at random, and even when the data are missing not at random