How do I best initiate a new variable to an existing DataFrame, if I want the new variable to contain only missing values? Later I want to assign float values to the new variable.
# Initiate a simple dataframe with one variable (:a) and two rows
df = DataFrame(a=[1, 2])
# Declare a new variable (:weight) and initialize it to missing (Type is Missing, but want Union{Missing, Float64})
df[:, :weight] .= missing
# I can change a single value for variable :a
df[1, :a] = 3
# I cannot change a missing value to a float value, i.e. I cannot change a single value of :weight
#df[1, :weight] = 1.0
# Verbose solution - declare new dataframe with a single column (:weight2) of the correct type
single_col_df = DataFrame(weight2 = Union{Missing, Float64}[missing for i in eachrow(df)])
# Concatenate the two dataframes horizontally
df = hcat(df, single_col_df)
# Now I can change a missing value to a float value
df[1, :weight2] = 1.0
Yes it’s a bit confusing… This was discussed in this PR and the follow-up one. Relying on this behavior is a bit like using reduce with a non-associative operation, or relying on the order of iteration for Dict keys: it might work in one case, but can fail in another:
julia> module A struct B end end;
julia> similar([1,2,3], Union{Missing, A.B})
3-element Vector{Union{Main.A.B, Missing}}:
Main.A.B()
Main.A.B()
Main.A.B()
This reminds me of an anecdote concerning the Go language: the maps (equivalent of Julia Dict) had a non-specified iteration order. But people would sometimes come to rely on it… So they ended up adding randomization to the iteration