Replacing *missing* and *NaN* values in dataframe

Hi! I was trying trying to change all missing and NaN values into 0.

Firstly, created a dataframe

using DataFrames
df_i = DataFrame( id =[101, 102, 103, 104, 105],
    name = ["A", "B", "C", NaN, "E"],
    age = [28, 32, missing, NaN, 31],
    salary = [3200, 3200, 4500, missing, missing]
)

Then tried a loop for replacing missing and NaN values, which was:

col = names(df_i);

for i in 1:length(col)
    replace!(df_i.col[i], missing => 0)
    replace!(df_i.col[i], NaN => 0)
end

Well, the arguement did not go well, hence error was inevitable.

In this context, I actually need 2 things to know:

i) Is there any way to replace missing, NaN, and any specific value, whole across the dataframe? If so, how?
ii) In my looping, what went wrong?

Your indexing does not make sense: what is df_i.col supposed to mean?

julia> df_i.col
ERROR: ArgumentError: column name :col not found in the data frame

To replace all missings use coalesce:

julia> coalesce.(df_i, 0.0)
5Γ—4 DataFrame
 Row β”‚ id     name  age      salary
     β”‚ Int64  Any   Float64  Float64
─────┼───────────────────────────────
   1 β”‚   101  A        28.0   3200.0
   2 β”‚   102  B        32.0   3200.0
   3 β”‚   103  C         0.0   4500.0
   4 β”‚   104  NaN     NaN        0.0
   5 β”‚   105  E        31.0      0.0

if you want to loop over columns, just do so directly:

julia> for c ∈ eachcol(df_i)
           replace!(c, NaN => 0.0)
       end

I’d also recommend going through https://github.com/bkamins/Julia-DataFrames-Tutorial to get the hang of DataFrames

2 Likes

If you want to replace both in one shot you can do:

julia> fun(x) = ismissing(x) || (x isa Number && isnan(x)) ? 0 : x
fun (generic function with 1 method)

julia> fun.(df_i)
5Γ—4 DataFrame
 Row β”‚ id     name  age      salary
     β”‚ Int64  Any   Float64  Int64
─────┼──────────────────────────────
   1 β”‚   101  A        28.0    3200
   2 β”‚   102  B        32.0    3200
   3 β”‚   103  C         0.0    4500
   4 β”‚   104  0         0.0       0
   5 β”‚   105  E        31.0       0
2 Likes

I bumped into the same issue yesterday in a context of:

  • DataFrames
  • with columns of multiples types (dates, numeric, string, categoricals)
  • with β€˜missing’ in multiple columns
  • NaN in Float columns (in addtion to missing)

I really struggled to find a robust method with the different types and mixes of NaN and missings.
Would it be possible to have a function like complete_cases or add an option to complete_cases to treat it?
It would be super useful for preprocessing, prior ML operations.

In context of df_i.col, since I created ,

col = names(df_i)

I thought it might work (kind of a litte experiment :sweat_smile:), as col[i] gives the column names, however, it evidently doesn’t.

Ah, you wanted the compiler to replace col[i] with a name before resolving the fact it is connected to the df_i by a dot. No, you cannot do this with this syntax, to programmatically access a field with its name value (instead the name as a literal spelled out in the source code) you need to use getproperty as in getproperty(df_i, col[i]). Basically, in Julia, any code obj.field_name is transformed before compilation into getproperty(obj, :field_name) (note the field name is a Symbol).

This seems more of a job of a specialized package for data cleaning as there are many possible ways how you might want to fill missing/NaN. However, I propose you open an issue in DataFrames.jl with your requirements and then we can think what makes sense to live in DataFrames.jl and what should go to a separate package.

In general to replace missings use coalesce as @nilshg proposed. To replace NaNs you need to check if the column is numeric and if the value is NaN. OP wanted both so I combined them in one function but I guess this is a rare use case.

3 Likes