How to apply replace to whole DataFrame like in Python?

To replace a character in df in Python we can:

df.replace("?", np.nan, inplace = True)

but it seems Julia not support replace to the whole DataFrame

replace(df, "?" => missing)

returns:

MethodError: no method matching similar(::DataFrames.DataFrame, ::Type{Any})

After some testing, I found the alternative using for loop:

for col in names(df)
    replace!(df[!, col], "?" => missing)
end

but actually I prefer one-liner code than for loop.

How can I achieve the same behavior like python code with Julia? What is the cleanest way to do that? Thank you.

Edit1:
Here is sample df:

df = DataFrame(a = [1, 2.2, "?"], b = ["b", "a" , "?"])

my dataframe that I’m working on is not only contain String but also Float and Int

Edit: This is unfortunately a false friend solution which only appears to work upon casual inspection.

You can use broadcasting to apply it to all values:

```julia df = DataFrame(a = ["a", "b", "?"], b = ["?", "b", "?"]) 3Γ—2 DataFrame Row β”‚ a b β”‚ String String ─────┼──────────────── 1 β”‚ a ? 2 β”‚ b b 3 β”‚ ? ?

replace.(df, β€œ?” => missing)
3Γ—2 DataFrame
Row β”‚ a b
β”‚ String String
─────┼──────────────────
1 β”‚ a missing
2 β”‚ b b
3 β”‚ missing missing


~~It does not seem to work with `replace!` though :frowning:~~
<\details>

in-place operation requires that your original columns accept missing:

julia> df = DataFrame(a = ["a", "b", "?"], b = ["?", "b", "?"])
3Γ—2 DataFrame
 Row β”‚ a       b
     β”‚ String  String
─────┼────────────────
   1 β”‚ a       ?
   2 β”‚ b       b
   3 β”‚ ?       ?

julia> allowmissing!(df) # this is needed for in-place to work
3Γ—2 DataFrame
 Row β”‚ a        b
     β”‚ String?  String?
─────┼──────────────────
   1 β”‚ a        ?
   2 β”‚ b        b
   3 β”‚ ?        ?

julia> @. df = ifelse(df == "?", missing, df)
3Γ—2 DataFrame
 Row β”‚ a        b
     β”‚ String?  String?
─────┼──────────────────
   1 β”‚ a        missing
   2 β”‚ b        b
   3 β”‚ missing  missing

Otherwise you need to replace columns:

julia> df = DataFrame(a = ["a", "b", "?"], b = ["?", "b", "?"])
3Γ—2 DataFrame
 Row β”‚ a       b
     β”‚ String  String
─────┼────────────────
   1 β”‚ a       ?
   2 β”‚ b       b
   3 β”‚ ?       ?

julia> df[!, :] = @. ifelse(df == "?", missing, df)
3Γ—2 DataFrame
 Row β”‚ a        b
     β”‚ String?  String?
─────┼──────────────────
   1 β”‚ a        missing
   2 β”‚ b        b
   3 β”‚ missing  missing

Note that it is not the same as:

julia> df = @. ifelse(df == "?", missing, df)
3Γ—2 DataFrame
 Row β”‚ a        b
     β”‚ String?  String?
─────┼──────────────────
   1 β”‚ a        missing
   2 β”‚ b        b
   3 β”‚ missing  missing

as this allocates a new data frame, while the above writes new columns into an existing data frame.

2 Likes
  1. Can you explain the details behind error MethodError: no method matching similar(::DataFrames.DataFrame, ::Type{Any}), it doesn’t make sense to me.
    Does replace() use similar()` under the hood?
  2. Why do you use ifelse() instead of replace() in this situation?

Idk why Julia behavior is so weird.
AFAIK, the dataframe after changing should be

replace.(df, "?" => missing)
3Γ—2 DataFrame
 Row β”‚ a        b       
     β”‚ String?   String?
─────┼──────────────────
   1 β”‚ a        missing 
   2 β”‚ b        b       
   3 β”‚ missing  missing 

not

replace.(df, "?" => missing)
3Γ—2 DataFrame
 Row β”‚ a        b       
     β”‚ String   String
─────┼──────────────────
   1 β”‚ a        missing 
   2 β”‚ b        b       
   3 β”‚ missing  missing 

because β€œ?” => missing not β€œ?” => β€œmissing”

It doesn’t make sense to me. Maybe I’m lacking knowledge about type system in Julia or maybe Julia is designed like that.

@bkamins Explained above: the original column is of type String, so when you try to add missing, it is converted to a String. That’s why he suggested the allowmissing! method to change the type of the columns to accept missing.

I have a question here for why not just use @DrChainsaw solution?
it looks simple and easy to understand.

df = DataFrame(a = ["a", "b", "?"], b = ["?", "b", "?"])
df = replace.(df, "?" => missing)

For two reasons:

  1. it allocates a new data frame, it is not in-place.
  2. it converts missing to "missing".

The point is that replace function has the following contract:

replace(s::AbstractString, pat=>r, [pat2=>r2, ...]; [count::Integer])

so essentially what you are doing is:

julia> replace("?", "?" => missing)
"missing"

To see that this is the case see:

julia> df = DataFrame(x = ["am I ?", "? or else", "now ? in the middle"])
3Γ—1 DataFrame
 Row β”‚ x
     β”‚ String
─────┼─────────────────────
   1 β”‚ am I ?
   2 β”‚ ? or else
   3 β”‚ now ? in the middle

julia> replace.(df, "?" => missing)
3Γ—1 DataFrame
 Row β”‚ x
     β”‚ String
─────┼───────────────────────────
   1 β”‚ am I missing
   2 β”‚ missing or else
   3 β”‚ now missing in the middle

In short - in this case replace does something completely different than you think it does.

1 Like