Replacing *missing* and *NaN* values in dataframe

Hi! I was trying trying to change all missing and NaN values into 0.

Firstly, created a dataframe

using DataFrames
df_i = DataFrame( id =[101, 102, 103, 104, 105],
    name = ["A", "B", "C", NaN, "E"],
    age = [28, 32, missing, NaN, 31],
    salary = [3200, 3200, 4500, missing, missing]
)

Then tried a loop for replacing missing and NaN values, which was:

col = names(df_i);

for i in 1:length(col)
    replace!(df_i.col[i], missing => 0)
    replace!(df_i.col[i], NaN => 0)
end

Well, the arguement did not go well, hence error was inevitable.

In this context, I actually need 2 things to know:

i) Is there any way to replace missing, NaN, and any specific value, whole across the dataframe? If so, how?
ii) In my looping, what went wrong?

1 Like

Your indexing does not make sense: what is df_i.col supposed to mean?

julia> df_i.col
ERROR: ArgumentError: column name :col not found in the data frame

To replace all missings use coalesce:

julia> coalesce.(df_i, 0.0)
5Γ—4 DataFrame
 Row β”‚ id     name  age      salary
     β”‚ Int64  Any   Float64  Float64
─────┼───────────────────────────────
   1 β”‚   101  A        28.0   3200.0
   2 β”‚   102  B        32.0   3200.0
   3 β”‚   103  C         0.0   4500.0
   4 β”‚   104  NaN     NaN        0.0
   5 β”‚   105  E        31.0      0.0

if you want to loop over columns, just do so directly:

julia> for c ∈ eachcol(df_i)
           replace!(c, NaN => 0.0)
       end

I’d also recommend going through https://github.com/bkamins/Julia-DataFrames-Tutorial to get the hang of DataFrames

2 Likes

If you want to replace both in one shot you can do:

julia> fun(x) = ismissing(x) || (x isa Number && isnan(x)) ? 0 : x
fun (generic function with 1 method)

julia> fun.(df_i)
5Γ—4 DataFrame
 Row β”‚ id     name  age      salary
     β”‚ Int64  Any   Float64  Int64
─────┼──────────────────────────────
   1 β”‚   101  A        28.0    3200
   2 β”‚   102  B        32.0    3200
   3 β”‚   103  C         0.0    4500
   4 β”‚   104  0         0.0       0
   5 β”‚   105  E        31.0       0
3 Likes

I bumped into the same issue yesterday in a context of:

  • DataFrames
  • with columns of multiples types (dates, numeric, string, categoricals)
  • with β€˜missing’ in multiple columns
  • NaN in Float columns (in addtion to missing)

I really struggled to find a robust method with the different types and mixes of NaN and missings.
Would it be possible to have a function like complete_cases or add an option to complete_cases to treat it?
It would be super useful for preprocessing, prior ML operations.

In context of df_i.col, since I created ,

col = names(df_i)

I thought it might work (kind of a litte experiment :sweat_smile:), as col[i] gives the column names, however, it evidently doesn’t.

Ah, you wanted the compiler to replace col[i] with a name before resolving the fact it is connected to the df_i by a dot. No, you cannot do this with this syntax, to programmatically access a field with its name value (instead the name as a literal spelled out in the source code) you need to use getproperty as in getproperty(df_i, col[i]). Basically, in Julia, any code obj.field_name is transformed before compilation into getproperty(obj, :field_name) (note the field name is a Symbol).

This seems more of a job of a specialized package for data cleaning as there are many possible ways how you might want to fill missing/NaN. However, I propose you open an issue in DataFrames.jl with your requirements and then we can think what makes sense to live in DataFrames.jl and what should go to a separate package.

In general to replace missings use coalesce as @nilshg proposed. To replace NaNs you need to check if the column is numeric and if the value is NaN. OP wanted both so I combined them in one function but I guess this is a rare use case.

3 Likes