Replacing missing and NaN values in dataframe

janilin · March 29, 2022, 4:13pm

Hi! I was trying trying to change all missing and NaN values into 0.

Firstly, created a dataframe

using DataFrames
df_i = DataFrame( id =[101, 102, 103, 104, 105],
    name = ["A", "B", "C", NaN, "E"],
    age = [28, 32, missing, NaN, 31],
    salary = [3200, 3200, 4500, missing, missing]
)

Then tried a loop for replacing missing and NaN values, which was:

col = names(df_i);

for i in 1:length(col)
    replace!(df_i.col[i], missing => 0)
    replace!(df_i.col[i], NaN => 0)
end

Well, the arguement did not go well, hence error was inevitable.

In this context, I actually need 2 things to know:

i) Is there any way to replace missing, NaN, and any specific value, whole across the dataframe? If so, how?
ii) In my looping, what went wrong?

nilshg · March 29, 2022, 4:18pm

Your indexing does not make sense: what is df_i.col supposed to mean?

julia> df_i.col
ERROR: ArgumentError: column name :col not found in the data frame

To replace all missings use coalesce:

julia> coalesce.(df_i, 0.0)
5×4 DataFrame
 Row │ id     name  age      salary
     │ Int64  Any   Float64  Float64
─────┼───────────────────────────────
   1 │   101  A        28.0   3200.0
   2 │   102  B        32.0   3200.0
   3 │   103  C         0.0   4500.0
   4 │   104  NaN     NaN        0.0
   5 │   105  E        31.0      0.0

if you want to loop over columns, just do so directly:

julia> for c ∈ eachcol(df_i)
           replace!(c, NaN => 0.0)
       end

I’d also recommend going through https://github.com/bkamins/Julia-DataFrames-Tutorial to get the hang of DataFrames

bkamins · March 29, 2022, 4:22pm

If you want to replace both in one shot you can do:

julia> fun(x) = ismissing(x) || (x isa Number && isnan(x)) ? 0 : x
fun (generic function with 1 method)

julia> fun.(df_i)
5×4 DataFrame
 Row │ id     name  age      salary
     │ Int64  Any   Float64  Int64
─────┼──────────────────────────────
   1 │   101  A        28.0    3200
   2 │   102  B        32.0    3200
   3 │   103  C         0.0    4500
   4 │   104  0         0.0       0
   5 │   105  E        31.0       0

julien_goo · March 29, 2022, 4:31pm

I bumped into the same issue yesterday in a context of:

DataFrames
with columns of multiples types (dates, numeric, string, categoricals)
with ‘missing’ in multiple columns
NaN in Float columns (in addtion to missing)

I really struggled to find a robust method with the different types and mixes of NaN and missings.
Would it be possible to have a function like complete_cases or add an option to complete_cases to treat it?
It would be super useful for preprocessing, prior ML operations.

janilin · March 29, 2022, 4:42pm

In context of df_i.col, since I created ,

col = names(df_i)

I thought it might work (kind of a litte experiment ), as col[i] gives the column names, however, it evidently doesn’t.

Henrique_Becker · March 29, 2022, 5:01pm

Ah, you wanted the compiler to replace col[i] with a name before resolving the fact it is connected to the df_i by a dot. No, you cannot do this with this syntax, to programmatically access a field with its name value (instead the name as a literal spelled out in the source code) you need to use getproperty as in getproperty(df_i, col[i]). Basically, in Julia, any code obj.field_name is transformed before compilation into getproperty(obj, :field_name) (note the field name is a Symbol).

bkamins · March 29, 2022, 5:09pm

This seems more of a job of a specialized package for data cleaning as there are many possible ways how you might want to fill missing/NaN. However, I propose you open an issue in DataFrames.jl with your requirements and then we can think what makes sense to live in DataFrames.jl and what should go to a separate package.

In general to replace missings use coalesce as @nilshg proposed. To replace NaNs you need to check if the column is numeric and if the value is NaN. OP wanted both so I combined them in one function but I guess this is a rare use case.

Topic		Replies	Views
Replacing missing, really Data	8	10773	February 26, 2020
Replace all NaN's with zeros in DataFrame General Usage dataframes	31	9919	April 18, 2023
DataFrame and Missings.replace() Data	10	4254	November 12, 2020
Iterate over all numeric columns in DataFrames Data	21	4855	February 11, 2018
Replace missing with 0.0 in dataframe General Usage question	10	4892	December 9, 2019

Replacing *missing* and *NaN* values in dataframe

Related topics

Replacing missing and NaN values in dataframe