Replacing missing, really


#1

How can I replace the missing value in a dataframe by, say, 0?

There is a Missings.replace function but it does not actually make the replacement. It builds a lazy data structure Missings.EachReplaceMissing{DataFrames.DataFrame,Int64}... that most functions working with dataframes (e.g. plots) can’t handle at the moment. Collecting such a structure does not produce a dataframe but gives a MethodError.


#2

To answer my own question, I went with

  for col in names(df)
    df[ismissing.(df[col]), col] = 0
  end

but I guess that there may be a simpler way.


#3

Missings.replace expects an iterable, but a DataFrame isn’t one.

I would do something like you did above. Note that DataFrames may not have a homogeneous column type, so replacing missing values in all columns with the same value may not be a common operation.

Alternatively,

using DataFrames
import Missings
df = DataFrame(a = [1, missing, 2], b = [missing, 3, 4])
df2 = map(c -> collect(Missings.replace(c, 0)), eachcol(df))

also works.


#4

@Tamas_Papp’s solution has the advantage that the returned DataFrame contains columns which do not allow for missing values, which will be faster. On the contrary, only replacing missing values with 0 won’t change the type of columns, though you could use disallowmissing to do that manually.

An alternative approach is to use coalesce (which is included in Base in Julia 0.7, but only in Missings in Julia 0.6):

for col in names(df)
   df[col] = Missings.coalesce.(df[col], 0)
end

Do you know of simpler ways of doing this in other software?


#5

Thanks for the answer. As far as I remember, this is done in R with

df[is.na(d)] <- 0

And now I realize that a similar syntax can be used in julia with the help of the eachcol iterator.

  for (_, col) in eachcol(df)
    col[ismissing.(col)] = val
  end

#6

In Stata you can use a function called mvencode

mvencode variable, mv(0)

But I hate it when people use it. If I’m reading a script I would rather see

replace variable = 0 if missing(variable)`

because it mirrors other replacements.

Maybe there are performance benefits to using mvencode but I’ve never noticed anything. With that in mind, @harven’s answer above is nice because it keeps the same syntax as other replacements.


#7

OK. The difference we have compared to R is that we support arrays which do not accept missing values, so there are more possible solutions than in R depending on the use case. Also I don’t think we want to allow df[ismissing.(df)] = 0, because data frames are not matrices. Anyway it’s a terribly inefficient approach since it forces you to allocate a matrix of the size of the dataframe.

mvencode appears to work variable by variable IIUC, so that’s more or less similar to what we have.