Replace missing values based on column data type

Hello,

Question 1:
I have df with multiple columns say a,b,c,d… so on and i would like to replace all missing values dynamically in each columns with 0(if column data type is int) and with “xyz” if column data type is string.

Question 2:

I have column a in dataframe df and it could consists of numerical values where they should have been strings and i tried to convert them using string.(df[!, :a]) and it worked but issue that i have identified was if column has missing values then they have been converted to string “missing” then I used passmissing to ignore missing values and it worked just fine. I would like to check if there is any efficient way to solve this issue.

For question 1, you may want below result:

julia> df = DataFrame(a = [-1, missing,missing,3], b = [4,5,missing, 7.2], c = ["k", missing, "m", "asd"])   
4×3 DataFrame
 Row │ a        b          c       
     │ Int64?   Float64?   String? 
─────┼─────────────────────────────
   1 │      -1        4.0  k
   2 │ missing        5.0  missing 
   3 │ missing  missing    m
   4 │       3        7.2  asd

julia> for col ∈ eachcol(df)
           if String <: eltype(col)
               col[ismissing.(col)] .= "xyz"
           else
               col[ismissing.(col)] .= 0
           end
       end

julia> df
4×3 DataFrame
 Row │ a       b         c       
     │ Int64?  Float64?  String?
─────┼───────────────────────────
   1 │     -1       4.0  k
   2 │      0       5.0  xyz
   3 │      0       0.0  m
   4 │      3       7.2  asd

Code:

using DataFrames

df = DataFrame(a = [-1, missing,missing,3], b = [4,5,missing, 7.2], c = ["k", missing, "m", "asd"])

for col ∈ eachcol(df)
    if String <: eltype(col)
        col[ismissing.(col)] .= "xyz"
    else
        col[ismissing.(col)] .= 0
    end
end
df

For question 2, well, I don’t know it’s efficient or short, but ifelse function solves almost cases.

julia> df = DataFrame(a = [-1, missing,missing,3], b = [4,5,missing, 7.2], c = ["k", missing, "m", "asd"])   
4×3 DataFrame
 Row │ a        b          c       
     │ Int64?   Float64?   String?
─────┼─────────────────────────────
   1 │      -1        4.0  k
   2 │ missing        5.0  missing
   3 │ missing  missing    m
   4 │       3        7.2  asd

julia> df[!, :a] = ifelse.(ismissing.(df[!, :a]), df[!, :a], string.(df[!, :a]))
4-element Vector{Union{Missing, String}}:
 "-1"
 missing
 missing
 "3"

julia> df
4×3 DataFrame
 Row │ a        b          c       
     │ String?  Float64?   String?
─────┼─────────────────────────────
   1 │ -1             4.0  k
   2 │ missing        5.0  missing
   3 │ missing  missing    m
   4 │ 3

This is an efficient solution. Did you have any issue with it?

@rmsmsgood Thank You, this solved my issue.

1 Like

@bkamins No, I just wanted to check if I am using right approach. Thanks for your response!

1 Like

@rmsmsgood Thanks for your response, here is what I am used just in case if anyone needs in future.

df.a = passmissing(x->string.(x)).(df.a)

1 Like

The following is cleaner IMO and enough:

julia> df = DataFrame(a=[1, missing, 2])
3×1 DataFrame
 Row │ a
     │ Int64?
─────┼─────────
   1 │       1
   2 │ missing
   3 │       2

julia> passmissing(string).(df.a)
3-element Vector{Union{Missing, String}}:
 "1"
 missing
 "2"

You do not need to broadcast string inside passmissing, as passmissing anyway gets a scalar.

2 Likes

An altenative solution to question 01:

julia> df = DataFrame(a=[-1,missing,missing,3], b=[4,5,missing,7.2], c=["k", missing,"m","asd"])
4×3 DataFrame
 Row │ a        b          c
     │ Int64?   Float64?   String?
─────┼─────────────────────────────
   1 │      -1        4.0  k
   2 │ missing        5.0  missing
   3 │ missing  missing    m
   4 │       3        7.2  asd

# get bool vector of numeric variables
julia> idxs = map(x -> nonmissingtype(x) <: Real, eltype.(eachcol(df)))
3-element Vector{Bool}:
 1
 1
 0

# apply transformation as you requested
julia> transform!(df,
           [col => ByRow(x->coalesce(x, 0)) for col in propertynames(df)[idxs]],
           [col => ByRow(x->coalesce(x, "xyz")) for col in propertynames(df)[.!idxs]],
           renamecols=false)
4×3 DataFrame
 Row │ a      b     c      
     │ Int64  Real  String 
─────┼─────────────────────
   1 │    -1   4.0  k
   2 │     0   5.0  xyz
   3 │     0   0    m
   4 │     3   7.2  asd

You could also change the filter in map to adjust your needs.
For example, you could add in anonymous function the CategoricalValue type from CategoricalArrays.jl

map(
  x -> (nonmissingtype(x) <: AbstractString) || (nonmissingtype(x) <: CategoricalValue), 
  eltype.(eachcol(df)
)
1 Like