Replacing missing, really

harven · December 23, 2017, 11:23pm

How can I replace the missing value in a dataframe by, say, 0?

There is a Missings.replace function but it does not actually make the replacement. It builds a lazy data structure Missings.EachReplaceMissing{DataFrames.DataFrame,Int64}... that most functions working with dataframes (e.g. plots) can’t handle at the moment. Collecting such a structure does not produce a dataframe but gives a MethodError.

harven · December 24, 2017, 1:08pm

To answer my own question, I went with

  for col in names(df)
    df[ismissing.(df[col]), col] = 0
  end

but I guess that there may be a simpler way.

Tamas_Papp · December 24, 2017, 1:30pm

Missings.replace expects an iterable, but a DataFrame isn’t one.

I would do something like you did above. Note that DataFrames may not have a homogeneous column type, so replacing missing values in all columns with the same value may not be a common operation.

Alternatively,

using DataFrames
import Missings
df = DataFrame(a = [1, missing, 2], b = [missing, 3, 4])
df2 = map(c -> collect(Missings.replace(c, 0)), eachcol(df))

also works.

nalimilan · December 24, 2017, 2:19pm

@Tamas_Papp’s solution has the advantage that the returned DataFrame contains columns which do not allow for missing values, which will be faster. On the contrary, only replacing missing values with 0 won’t change the type of columns, though you could use disallowmissing to do that manually.

An alternative approach is to use coalesce (which is included in Base in Julia 0.7, but only in Missings in Julia 0.6):

for col in names(df)
   df[col] = Missings.coalesce.(df[col], 0)
end

Do you know of simpler ways of doing this in other software?

harven · December 24, 2017, 5:36pm

Thanks for the answer. As far as I remember, this is done in R with

df[is.na(d)] <- 0

And now I realize that a similar syntax can be used in julia with the help of the eachcol iterator.

  for (_, col) in eachcol(df)
    col[ismissing.(col)] = val
  end

pdeffebach · December 24, 2017, 5:57pm

In Stata you can use a function called mvencode

mvencode variable, mv(0)

But I hate it when people use it. If I’m reading a script I would rather see

replace variable = 0 if missing(variable)`

because it mirrors other replacements.

Maybe there are performance benefits to using mvencode but I’ve never noticed anything. With that in mind, @harven’s answer above is nice because it keeps the same syntax as other replacements.

nalimilan · December 25, 2017, 10:46pm

OK. The difference we have compared to R is that we support arrays which do not accept missing values, so there are more possible solutions than in R depending on the use case. Also I don’t think we want to allow df[ismissing.(df)] = 0, because data frames are not matrices. Anyway it’s a terribly inefficient approach since it forces you to allocate a matrix of the size of the dataframe.

mvencode appears to work variable by variable IIUC, so that’s more or less similar to what we have.

Juan · January 17, 2019, 2:10am

Can’t we just do this?

replace(df, missing=>0)

kmundnic · February 26, 2020, 11:20pm

Note that it’s now possible to do:

coalesce.(df, 0)

Topic		Replies	Views
Replace missing with 0.0 in dataframe General Usage question	10	4762	December 9, 2019
DataFrame and Missings.replace() Data	10	4243	November 12, 2020
Replacing missing and NaN values in dataframe New to Julia question , dataframes , missing-values	6	3601	March 29, 2022
Replace missing values based on column data type General Usage package , plotting , strings , dataframes , missing-values	7	905	February 10, 2023
Csv and dataframes and missing values New to Julia dataframes , csv	9	2082	June 12, 2021

Replacing missing, really

Related topics