Transform! changes the type of my variable

I have a mix of strings and missing values, so I have defined my DF column to be Union{String,Missing}

In rare cases, all the values in this column might be missing. In this case, transform! will convert the column to be Missing instead, dropping the ‘Union’.

For example:

using DataFrames
df1 = DataFrame(str=Union{String,Missing}[])
append!(df1.str, ["Hello", "World"])
println("df1 before : ", typeof(df1.str))
transform!(df1, names(df1) .=> ByRow(x -> ismissing(x) ? missing : isa(x, String) ? string(strip(x)) : x) .=> names(df1))
println("df1 after  : ", typeof(df1.str))

df2 = DataFrame(str=Union{String,Missing}[])
append!(df2.str, [missing, missing])
println("df2 before : ", typeof(df2.str))
transform!(df2, names(df2) .=> ByRow(x -> ismissing(x) ? missing : isa(x, String) ? string(strip(x)) : x) .=> names(df2))
println("df2 after  : ", typeof(df2.str))

gives methe following results:

df1 before : Vector{Union{Missing, String}}
df1 after  : Vector{String}
df2 before : Vector{Union{Missing, String}}
df2 after  : Vector{Missing}

I don’t want the type of the variable in my DF to change - it screws up subsequent processing in this rare case of all missing data. How can I avoid this?

Thanks

1 Like

allowmissing!(df2) restores the types to Union{Missing,String} (with the optional column args restricting the operation to the columns modified by transform!).

See Missing Data · DataFrames.jl

Thank you for this suggestion @hendri54 . However, it doesn’t work in my example. df2 has been changed to type Missing by the transform. I need the String part of the union restored, not the Missing part.

In other words, allowmissing! achieves this:

df1 before : Vector{Union{Missing, String}}
df1 after  : Vector{Union{Missing, String}}
df2 before : Vector{Union{Missing, String}}
df2 after  : Vector{Missing}
1 Like

To illustrate my issue further:

df2 = DataFrame(str=Union{String,Missing}[missing, missing])
transform!(df2, names(df2) .=> ByRow(x -> ismissing(x) ? missing : isa(x, String) ? string(strip(x)) : x) .=> names(df2))

df2[1, :str] = "Hello"

The final assignment fails:

ERROR: LoadError: cannot convert a value to missing for assignment

Works, but not pretty.

foreach(
    x -> df2[!, x] = convert(Vector{Union{String, Missing}}, df2[!, x]),
    names(df2)
)

or

transform!(
    df2,
    names(df2) .=> (x -> convert(Vector{Union{String,Missing}}, x)) .=> names(df2)
)

This is good, but not all my columns are String types. I’ll need to iterate over just those columns that were initially Union{String, Missing} to restore the union. I only need to apply the transform to these anyway.

Thank you.

Even if it’s not pretty, this works now:

df2 = DataFrame(str=Union{String,Missing}[missing, missing])
for col in eachcol(df2)
    if col isa Union{String,Missing}
        transform!(df2, col .=> ByRow(x -> ismissing(x) ? missing : string(strip(x)) .=> col))
        convert(Vector{Union{String,Missing}}, df2[!, col])
    end
end

Thanks.

1 Like

This is definitely a pretty unfortunate issue and I’m not sure a great way to work around it, to be honest.

Interestingly, R infers the type of output as a vector character elements in the same scenario

r$> df = tibble(str = c(NA, NA))

r$> df2 = mutate(df, str = trimws(str))

r$> df2
# A tibble: 2 × 1
  str  
  <chr>   # <------- See here
1 NA   
2 NA   

I think a Base dev’s opinion might be useful for this. Is there a convenient way to tell Julia to allocate an output vector based on inferred types (if there is type stability, I guess), and not the actual type that is created?

This would be a nice quality-of-life improvement for missings, but I’m not sure how it would all work

OP you may also find BangBang.jl’s setindex!! function useful. See here.