I have a mix of strings and missing values, so I have defined my DF column to be Union{String,Missing}
In rare cases, all the values in this column might be missing. In this case, transform! will convert the column to be Missing instead, dropping the ‘Union’.
df1 before : Vector{Union{Missing, String}}
df1 after : Vector{String}
df2 before : Vector{Union{Missing, String}}
df2 after : Vector{Missing}
I don’t want the type of the variable in my DF to change - it screws up subsequent processing in this rare case of all missing data. How can I avoid this?
allowmissing!(df2) restores the types to Union{Missing,String} (with the optional column args restricting the operation to the columns modified by transform!).
Thank you for this suggestion @hendri54 . However, it doesn’t work in my example. df2 has been changed to type Missing by the transform. I need the String part of the union restored, not the Missing part.
In other words, allowmissing! achieves this:
df1 before : Vector{Union{Missing, String}}
df1 after : Vector{Union{Missing, String}}
df2 before : Vector{Union{Missing, String}}
df2 after : Vector{Missing}
This is good, but not all my columns are String types. I’ll need to iterate over just those columns that were initially Union{String, Missing} to restore the union. I only need to apply the transform to these anyway.
df2 = DataFrame(str=Union{String,Missing}[missing, missing])
for col in eachcol(df2)
if col isa Union{String,Missing}
transform!(df2, col .=> ByRow(x -> ismissing(x) ? missing : string(strip(x)) .=> col))
convert(Vector{Union{String,Missing}}, df2[!, col])
end
end
This is definitely a pretty unfortunate issue and I’m not sure a great way to work around it, to be honest.
Interestingly, R infers the type of output as a vector character elements in the same scenario
r$> df = tibble(str = c(NA, NA))
r$> df2 = mutate(df, str = trimws(str))
r$> df2
# A tibble: 2 × 1
str
<chr> # <------- See here
1 NA
2 NA
I think a Base dev’s opinion might be useful for this. Is there a convenient way to tell Julia to allocate an output vector based on inferred types (if there is type stability, I guess), and not the actual type that is created?
This would be a nice quality-of-life improvement for missings, but I’m not sure how it would all work
OP you may also find BangBang.jl’s setindex!! function useful. See here.